Path and URL options



The Verity Spider path and URL options are:

-auth

Syntax

-auth path_and_filename

Specifies an authorization file to support authentication for secure paths.

Use the -auth option to specify the authorization file. The file contains one record per line. Each line consists of server, realm, user name, and password, separated by whitespace.

The following is a sample authorization file:

# This is the Authorization file for HTTP's Basic Authentication 
#server   realm    username      password 
doleary   MACR     my_username   my_password

-cgiok

Type

Web crawling only

Lets you index URLs containing query strings. That is, a question mark (?) followed by additional information. This typically means that the URL leads to a CGI or other processing program.

The return document produced by the web server is indexed and parsed for document links, which are followed and in turn indexed and parsed. However, if the web server does not return a page, perhaps because the URL is missing parameters that are required to produce a page, nothing happens. There is no page to index and parse.

Example

The following is a URL without parameters:

http://server.com/cgi-bin/program?

You can use the -start option to include parameters in the URL that you are indexing. The parameters are processed and any resulting pages are indexed and parsed.

By default, a URL with a question mark (?) is skipped.

-domain

Type

Web crawling only

Syntax

-domain name_1 [name_n] ...

Limits indexing to the specified domains. Use only complete text strings for domains. You cannot use wildcard expressions. URLs not in the specified domains are not downloaded or parsed.

You can list multiple domains by separating each one with a single space.

Note: You must have the appropriate Verity Spider licensing capability to use this option. The version of Verity Spider that is included with ColdFusion is licensed for websites that are defined and reside on the same computer on which ColdFusion is installed. Contact Verity Sales for licensing options regarding the use of Verity Spider for external websites.

-followdup

Specifies that Verity Spider follows links within duplicate documents, although only the first instance of any duplicate documents is indexed.

You might find this option useful if you use the same home page on multiple sites. By default, only the first instance of the document is indexed, while subsequent instances are skipped. If you have different secondary documents on the different sites, using the -followdup option lets you get to them for indexing, while still indexing the common home page only once.

-followsymlink

Type

File system only

Specifies that Verity Spider follows symbolic links when indexing UNIX file systems.

-host

Type

Web crawling only

Syntax

-host name_1 [name_n] ...

Limits indexing to the specified host or hosts. Use only complete text strings for hosts. You cannot use wildcard expressions.

You can list multiple hosts by separating each one with a single space. URLs not on the specified hosts are not downloaded or parsed.

-https

Type

Web crawling only

Lets you index SSL-enabled websites.

Note: You must have the Verity SSL Option Pack installed to use the -https option. The Verity SSL Option Pack is a Verity Spider add-on available separately from a Verity salesperson.

-jumps

Type

Web crawling only

Syntax

-jumps num_jumps

Specifies the maximum number of levels an indexing job can go from the starting URL. Specify a number from 0 to 254.

The default value is unlimited. If you see large numbers of documents in a collection where you do not expect them, consider experimenting with this option, with the Content options, to pare down your collection.

-nodocrobo

Specifies to ignore ROBOT META tag directives.

In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt under the root directory of a website. In HTML 4.0, every document can have robot directives embedded in the META field. Use this option to ignore them. Use this option with discretion.

-nofollow

Type

Web crawling only

Syntax

-nofollow "exp"

Specifies that Verity Spider cannot follow any URLs that match the exp expression. If you do not specify an exp value for the -nofollow option, Verity Spider assumes a value of "*", where no documents are followed.

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters. Always encapsulate the exp values in double-quotation marks to ensure that they are properly interpreted.

If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

To use regular expressions, also specify the -regexp option.

Earlier versions of Verity Spider did not allow the use of an expression. This meant that for each starting point URL, only the first document would be indexed. With the addition of the expression functionality, you can now selectively skip URLs, even within documents.

See also

-regexp

-norobo

Type

Web crawling only

Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.

If you are reindexing a site and the robots.txt file has changed, Verity Spider deletes documents that have been newly disallowed by the robots.txt file.

Use this option with discretion and extreme care, especially with the -cgiok option.

See also

-nodocrobo.

-pathlen

Syntax

-pathlen num_pathsegments

Limits indexing to the specified number of path segments in the URL or file system path. The path length is determined as follows:

  • The host name and drive letter are not included. For example, www.spider.com:80/ or C:\ is not included in determining the path length.

  • All elements following the host name are included.

  • The actual filename, if present, is included; for example, /world.html would be included in determining the path length.

  • Any directory paths between the host and the actual filename are included.

Example

For the following URL, the path length would be four:

http://www.spider:80/comics/fun/funny/world.html 
    <-1-><2><-3-> <---4--->

For the following file system path, the path length would be three:

C:\files\docs\datasheets 
<-1-><-2-><---3--->

The default value is 100 path segments.

-refreshtime

Syntax

-refreshtime timeunits

Specifies not to refresh any documents that have been indexed since the timeunits value began.

The following is the syntax for timeunits:

n day n hour n min n sec

Where n is a positive integer. You must include spaces, and since the first three letters of each time unit are parsed, you can use the singular or plural form of the word.

If you specify the following:

-refreshtime 1 day 6 hours

Only those documents that were last indexed at least 30 hours and 1 sec ago, are refreshed.

Note: This option is valid only with the -refresh option. When you use vsdb -recreate, the last indexed date is cleared.

-reparse

Type

Web crawling only

Forces parsing of all HTML documents already in the collection. Specify a starting point with the -start option when you use the -reparse option.

You can use the -reparse option when you want to include paths and documents that were previously skipped due to exclusion or inclusion criteria. In Verity Spider, make sure that you change the criteria while using the -cmdfile option.

-unlimited

Specifies that no limits are placed on Verity Spider if the -host or the -domain option is not specified. The default is to limit based on the host of the first starting point listed.

-virtualhost

Syntax

-virtualhost name_1 [name_n] ...

Specifies that DNS lookups are avoided for the hosts listed. Use only complete text strings for hosts. You cannot use wildcard expressions. This lets you index by alias, such as when multiple web servers are running on the same host. You can use regular expressions.

Normally, when Verity Spider resolves host names, it uses DNS lookups to convert the names to canonical names, of which there can be only one per computer. This allows for the detection of duplicate documents, to prevent results from being diluted. For multiple aliased hosts, however, duplication is not a barrier as documents can be referred to by more than one alias and yet remain distinct because of the different alias names.

Example

You can have both marketing.verity.com and sales.verity.com running on the same host. Each alias has a different document root, although document names such as index.htm can occur for both. With the -virtualhost option, both server aliases can be indexed as distinct sites. Without the -virtualhost option, they would both be resolved to the same host name, and only the first document encountered from any duplicate pair would be indexed.

Note: If you are using Netscape Enterprise Server, and you have specified only the host name as a virtual host, Verity Spider cannot index the virtual host site. This is because Verity Spider always adds the domain name to the document key.