ColdFusion 9.0 Resources |
Path and URL optionsThe Verity Spider path and URL options are: -authSyntax-auth path_and_filename Specifies an authorization file to support authentication for secure paths. Use the -auth option to specify the authorization file. The file contains one record per line. Each line consists of server, realm, user name, and password, separated by whitespace. The following is a sample authorization file: # This is the Authorization file for HTTP's Basic Authentication #server realm username password doleary MACR my_username my_password -cgiokTypeWeb crawling only Lets you index URLs containing query strings. That is, a question mark (?) followed by additional information. This typically means that the URL leads to a CGI or other processing program. The return document produced by the web server is indexed and parsed for document links, which are followed and in turn indexed and parsed. However, if the web server does not return a page, perhaps because the URL is missing parameters that are required to produce a page, nothing happens. There is no page to index and parse. ExampleThe following is a URL without parameters: http://server.com/cgi-bin/program? You can use the -start option to include parameters in the URL that you are indexing. The parameters are processed and any resulting pages are indexed and parsed. By default, a URL with a question mark (?) is skipped. -domainSyntax-domain name_1 [name_n] ... Limits indexing to the specified domains. Use only complete text strings for domains. You cannot use wildcard expressions. URLs not in the specified domains are not downloaded or parsed. You can list multiple domains by separating each one with a single space. Note: You must have the appropriate Verity Spider licensing
capability to use this option. The version of Verity Spider that
is included with ColdFusion is licensed for websites that are defined
and reside on the same computer on which ColdFusion is installed.
Contact Verity Sales for licensing options regarding the use of
Verity Spider for external websites.
-followdupSpecifies that Verity Spider follows links within duplicate documents, although only the first instance of any duplicate documents is indexed. You might find this option useful if you use the same home page on multiple sites. By default, only the first instance of the document is indexed, while subsequent instances are skipped. If you have different secondary documents on the different sites, using the -followdup option lets you get to them for indexing, while still indexing the common home page only once. -jumpsSyntax-jumps num_jumps Specifies the maximum number of levels an indexing job can go from the starting URL. Specify a number from 0 to 254. The default value is unlimited. If you see large numbers of documents in a collection where you do not expect them, consider experimenting with this option, with the Content options, to pare down your collection. -nodocroboSpecifies to ignore ROBOT META tag directives. In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt under the root directory of a website. In HTML 4.0, every document can have robot directives embedded in the META field. Use this option to ignore them. Use this option with discretion. -nofollowSyntax-nofollow "exp" Specifies that Verity Spider cannot follow any URLs that match the exp expression. If you do not specify an exp value for the -nofollow option, Verity Spider assumes a value of "*", where no documents are followed. You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. Always encapsulate the exp values in double-quotation marks to ensure that they are properly interpreted. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path To use regular expressions, also specify the -regexp option. Earlier versions of Verity Spider did not allow the use of an expression. This meant that for each starting point URL, only the first document would be indexed. With the addition of the expression functionality, you can now selectively skip URLs, even within documents. -noroboTypeWeb crawling only Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to specify what parts of the site indexers should avoid. The default is to honor any robots.txt files. If you are reindexing a site and the robots.txt file has changed, Verity Spider deletes documents that have been newly disallowed by the robots.txt file. Use this option with discretion and extreme care, especially with the -cgiok option. -pathlenSyntax-pathlen num_pathsegments Limits indexing to the specified number of path segments in the URL or file system path. The path length is determined as follows:
-refreshtimeSyntax-refreshtime timeunits Specifies not to refresh any documents that have been indexed since the timeunits value began. The following is the syntax for timeunits: n day n hour n min n sec Where n is a positive integer. You must include spaces, and since the first three letters of each time unit are parsed, you can use the singular or plural form of the word. If you specify the following: -refreshtime 1 day 6 hours Only those documents that were last indexed at least 30 hours and 1 sec ago, are refreshed. Note: This option is valid
only with the -refresh option. When you use vsdb -recreate,
the last indexed date is cleared.
-reparseTypeWeb crawling only Forces parsing of all HTML documents already in the collection. Specify a starting point with the -start option when you use the -reparse option. You can use the -reparse option when you want to include paths and documents that were previously skipped due to exclusion or inclusion criteria. In Verity Spider, make sure that you change the criteria while using the -cmdfile option. -unlimitedSpecifies that no limits are placed on Verity Spider if the -host or the -domain option is not specified. The default is to limit based on the host of the first starting point listed. -virtualhostSyntax-virtualhost name_1 [name_n] ... Specifies that DNS lookups are avoided for the hosts listed. Use only complete text strings for hosts. You cannot use wildcard expressions. This lets you index by alias, such as when multiple web servers are running on the same host. You can use regular expressions. Normally, when Verity Spider resolves host names, it uses DNS lookups to convert the names to canonical names, of which there can be only one per computer. This allows for the detection of duplicate documents, to prevent results from being diluted. For multiple aliased hosts, however, duplication is not a barrier as documents can be referred to by more than one alias and yet remain distinct because of the different alias names. ExampleYou can have both marketing.verity.com and sales.verity.com running on the same host. Each alias has a different document root, although document names such as index.htm can occur for both. With the -virtualhost option, both server aliases can be indexed as distinct sites. Without the -virtualhost option, they would both be resolved to the same host name, and only the first document encountered from any duplicate pair would be indexed. Note: If you are using Netscape Enterprise Server, and
you have specified only the host name as a virtual host, Verity
Spider cannot index the virtual host site. This is because Verity
Spider always adds the domain name to the document key.
|