Content options



The Verity Spider content options are:

-casesen

Makes processing case sensitive by specifying that the spider separately process keys that differ only in case. Use only for indexing UNIX servers.

-exclude

Syntax

-exclude exp_1 [exp_n] ...

Specifies that files, paths, and URLs matching the specified expressions are not followed. If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

To use regular expressions, also specify the -regexp option.

To specify a file, path, or URL that you want followed but not indexed, use the -indexclude option. For document types, use the -mimeexclude option instead; for example, specify ‑mimeexclude application/pdf rather than -exclude*.pdf.

Note: When specifying a URL, use full, absolute paths using the same format that appears in the HTML hypertext link. If the link is relative, change it to absolute to use it with the -exclude option.

See also

-regexp.

-include

Specifies that only those files, paths, and URLs that match the specified expression or expressions are followed. If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

To use regular expressions, also specify the -regexp option.

If your starting points do not contain the specified -include expressions, nothing is indexed. The -include option prevents Verity Spider from even following anything that does not match the specified expressions. You might want to use the -indinclude option instead. Where the -include option prevents Verity Spider from even following anything that does not match the specified expressions, the -indinclude option allows Verity Spider to follow what matches the specified expressions, while not indexing.

For document types, use the -mimeinclude option instead; for example, specify ‑mimeinclude text/html rather than -include *.htm.

Note: When specifying a URL, use full, absolute paths using the same format that appears in the HTML hypertext link. If the link is relative, change it to absolute to use it with the -include option.

See also

-regexp.

-indexclude

Syntax

-indexclude exp_1 [exp_n] ...

Specifies that the files and paths in URLs that match the expressions are not indexed. They are, however, still followed. If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

To use regular expressions, also specify the -regexp option.

You would use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing.

Where the -exclude option prevents Verity Spider from even following anything that matches the specified expressions, the -indexclude option allows Verity Spider to follow anything while only skipping that which matches the specified expressions.

For document types, use the -indmimeexclude option instead.

Note: When specifying a URL, use full, absolute paths using the same format as appears in the HTML hypertext link. If the link is relative, change it to absolute to use it with -indexclude.

See also

-regexp.

-indinclude

Syntax

-indinclude exp_1 [exp_n] ...

Specifies that only those files and paths in URLs that match the expressions be followed and indexed. If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

To use regular expressions, also specify the -regexp option.

Where the -include option prevents Verity Spider from even following anything that does not match the specified expressions, the -indinclude option allows Verity Spider to follow anything while only indexing that which matches the specified expressions.

Example

If you want to index all documents that include "search" in the URL at http://web.verity.com, you cannot use the following:

vspider -collection collname -start http://web.verity.com  
    -include '*search*'

This is because the starting point does not match the -include option criteria. Instead, use the -indinclude option to follow all documents (unless you have specified any of the exclude options) and index only those documents that match your criteria. Replace the -include option with the -indinclude option in the preceding example.

Note: When specifying a URL, use full, absolute paths using the same format that appears in the HTML hypertext link. If the link is relative, change it to absolute to use it with the -indinclude option.

See also

-regexp.

-indmimeexclude

Syntax

-indmimeexclude mime_1 [mime_n] ...

Specifies that only those MIME types that match the expressions be followed but not indexed.

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

Use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing. The -mimeexclude option, on the other hand, prevents specified documents from being followed at all. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example:

'text/*'

You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions.

-indmimeinclude

Syntax

-indmimeinclude mime_1 [mime_n] ...

Specifies that only those MIME types that match the expressions are followed and indexed.

The -mimeinclude option does not let you index desired documents if the starting URL is not followed. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example:

'text/*'

In Windows, include double-quotation marks around the argument to protect the special character (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

You cannot use the question mark (?) wildcard, and the -regexp option does not allow you to use regular expressions.

Example

If you want to index all Word documents at http://web.verity.com, you cannot use:

vspider -collection collname -style style_dir -start  
    http://web.verity.com -mimeinclude 'application/msword'

This is because the starting point does not match the -mimeinclude criteria. You can use the ‑indmimeinclude option to follow all documents (unless you have specified any of the exclude options) and index only those documents that match your criteria. Replace the -mimeinclude option with the -indmimeinclude option in the preceding example.

-indskip

Syntax

-indskip HTML_tag "exp"

Type

Web crawling only

Specifies that Verity Spider follow and parse links, but not index, any HTML document that contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option.

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

To use regular expressions, also specify the -regexp option.

Example 1

To skip all HTML documents that contain the word "personnel" in the Title element, while still parsing those documents for links to other documents, use the following:

-indskip title "personnel"

Example 2

To avoid indexing directory listing pages, while still parsing the document and path links except for the link to the parent directory, use one of the following, depending on the web server being indexed:

  1. For Netscape web servers, use the following:

    -indskip title "*Index of*" 
    -nofollow "*parent directory*"
  2. For Microsoft Internet Information Server, use the following:

    -indskip a "*to parent directory*" 
    -nofollow "*parent directory*"

-maxdocsize

Syntax

-maxdocsize integer

Specifies the maximum size, in kilobytes, for documents to be indexed. Any documents larger than the value specified by the -maxdocsize option are ignored.

The default is to index documents of any size.

-metafile

Type

Web crawling only

Syntax

-metafile path_and_filename

Lets you use a text file to map custom meta tags to valid HTTP header fields. If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

This means that you can use your own meta tag, in the document, to replace what is returned by the web server, or to insert it if nothing is returned. Currently, the only header fields of real value are "Last-Modified" and "Content-Length." Future enhancements could allow for greater variety.

The following is the syntax for entries in the text file:

name Last-Modified y|n

or

name Content-Length y|n

Where y|n is an override flag, which can be yes or no.

Example

A mapping file for the -metafile option might include the following:

Doc_Last_Touched Last-Modified n 
Doc_Size Content-Length y

If you use the y override flag, the value for the custom meta tag overrides the value for the valid field, even if both values are present and differ. This can be useful when the valid field value is always sent, but you want to specify your own value with a custom meta tag.

If you use the n override flag, the value for the custom meta tag is used only if there is no value for the valid field returned by the server. If a value for the valid field exists, it is given precedence.

Note: If you have several entries mapping to the same valid field, only the last entry takes effect.

-mimeexclude

Syntax

-mimeexclude mime_1 [mime_n] ...

Specifies MIME types that are not followed or indexed.

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

The default is to include all MIME types. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example:

'text/*'

You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions.

Use the -indmimeexclude option to allow Verity Spider to follow documents, without indexing them, to gain access to other desirable document types.

-mimeinclude

Syntax

-mimeinclude mime_1 [mime_n] ...

Specifies MIME types to be included.

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

The default is to include all MIME types. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example:

'text/*'

You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions.

-mindocsize

Syntax

-mindocsize integer

Specifies the minimum size, in kilobytes, for documents to be indexed. Any documents smaller than the value specified by the -mindocsize option are ignored.

The default is to index documents of any sizes.

-skip

Type

Web crawling only

Syntax

-skip HTML_tag "exp"

Specifies that Verity Spider not index any HTML document that contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option.

You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example:

'/my_doc*/year199?'

In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option).

If you use backslashes, double them so that they are properly escaped; for example:

C:\\test\docs\path

To use regular expressions, also specify the -regexp option.

Example 1

To skip all HTML documents that contain the word "personnel" in the Title element, use the following:

-skip title "personnel"

Example 2

To skip all HTML documents that contain both the word "private" and the phrase "internal user" in any paragraph element, use the following:

-skip title "personnel" 
-skip p "*internal use*"

See also

-regexp.