Networking options



The Verity Spider networking options are listed here.

-agentname

Type

Web crawling only

Syntax

-agentname string

Specifies the value for the agent name field that is part of the HTTP request. You can use the -agentname option to impersonate a browser client because web servers can be configured to return different versions of the same page depending on the requesting agent.

Use double-quotation marks if the name contains a space. Use the -cmdfile option if the agent name you want to use contains forbidden characters, such as slashes or backslashes.

-connections

Syntax

-connections num_connections

Specifies the maximum number of simultaneous socket connections to make to websites for indexing. Each connection implies a separate thread.

The default value is 6.

Note: The Verity Spider dynamic flow control makes the most use of all available connections when indexing websites. If you are indexing multiple sites, you might want to increase this number. Increasing the number of connections does not always help, because of such dependencies as your network connection and the capabilities of the remote hosts.

-delay

Type

Web crawling only

Syntax

-delay num_milliseconds

Specifies the minimum time between HTTP requests, in milliseconds. The default value is 0 milliseconds for no delay.

-header

Type

Web crawling only

Syntax

-header string

Specifies an HTTP header to add to the request; for example:

-header "Referer: http://www.verity.com/"

Verity Spider sends some predefined headers, such as Accept and User-Agent, by default. Special headers are sometimes necessary to correctly index a site.

For example, earlier versions of Verity Spider did not support the Host header, which was needed for Virtual Host indexing. Also, a Proxy-authentication header was required to pass a user name and password to a proxy server. In the current version of Verity Spider, the Host header is supported by default, and the -proxyauth option is available for proxy server authentication. Therefore, the -header option is maintained only for backwards compatibility and possible future enhancements.

Note: Misuse of this option causes spider failure. If this happens, rerun the indexing task with modified -header values.

-hostcache

Syntax

-hostcache num_hostnames

Specifies the number of host names to cache to avoid DNS lookups. Without this option, the host cache continues to grow.

The default value is 256.

-noflowctrl

Type

Web crawling only

Disables round-robin indexing of websites with network flow control.

By default, Verity Spider uses round-robin indexing of websites to avoid overwhelming a web server and to improve indexing performance. Verity Spider connects to each web server in a round-robin manner, using up to the value for the -connections option. This means that one URL is fetched from each web server, in turn.

Note: Using the -noflowctrl option can result in a significant drop in performance.

-noproxy

Type

Web crawling only

Syntax

-noproxy name_1 [name_n] ...

Used with the -proxy option, the -noproxy option specifies that Verity Spider directly access the hosts whose names match those specified. By default, when you specify the -proxy option, Verity Spider first tries to access every host with the proxy information. To improve performance, use the -noproxy option for the hosts you know can be accessed without a proxy host. For the name variable, you can use the asterisk (*) wildcard for text strings; for example:

'*.verity.com'

You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions.

In Windows, include double-quotation marks around the argument to protect the asterisk special character (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

Note: You must have valid Verity Spider licensing capability to use this option.

-proxy

Type

Web crawling only

Syntax

-proxy proxyhost:port

Specifies host and port for proxy server.

Note: You must have valid Verity Spider licensing capability to use this option.

See also -proxyauth for proxy servers that require authentication, and -noproxy for hosts that you know are accessible without having to go through a proxy server.

-proxyauth

Type

Web crawling only

Syntax

-proxyauth login:password

Specifies login information for proxy server connections that require authorization to get outside the firewall. Use this option with the -proxy option.

Note: You must have valid Verity Spider licensing capability to use this option. Information Server V3.7 does not support retrieving documents for viewing through secure proxy servers. Do not use the -proxyauth option for indexing documents that are viewed through Information Server V3.7

-retry

Type

Web crawling only

Syntax

-retry num_retries

Specifies the number of times that Verity Spider should attempt to access a URL. Use the -retry option when an unstable network connection returns a false rejection.

The default value is 4.

-timeout

Type

Web crawling only

Syntax

-timeout num_seconds

Specifies the time period, in seconds, that Verity Spider should wait before timing out on a network connection and on accessing data. The data access value is automatically twice the value that you specify for the network connection time out.

The default value for the network connection time-out is 30 seconds, and therefore the default value for the data access time-out is 60 seconds.