ColdFusion 9.0 Resources |
Content optionsThe Verity Spider content options are: -casesenMakes processing case sensitive by specifying that the spider separately process keys that differ only in case. Use only for indexing UNIX servers. -excludeSyntax-exclude exp_1 [exp_n] ... Specifies that files, paths, and URLs matching the specified expressions are not followed. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). To use regular expressions, also specify the -regexp option. To specify a file, path, or URL that you want followed but not indexed, use the -indexclude option. For document types, use the -mimeexclude option instead; for example, specify ‑mimeexclude application/pdf rather than -exclude*.pdf. Note: When specifying a URL, use full, absolute paths
using the same format that appears in the HTML hypertext link. If
the link is relative, change it to absolute to use it with the -exclude option.
-includeSpecifies that only those files, paths, and URLs that match the specified expression or expressions are followed. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). To use regular expressions, also specify the -regexp option. If your starting points do not contain the specified -include expressions, nothing is indexed. The -include option prevents Verity Spider from even following anything that does not match the specified expressions. You might want to use the -indinclude option instead. Where the -include option prevents Verity Spider from even following anything that does not match the specified expressions, the -indinclude option allows Verity Spider to follow what matches the specified expressions, while not indexing. For document types, use the -mimeinclude option instead; for example, specify ‑mimeinclude text/html rather than -include *.htm. Note: When specifying a URL, use full, absolute paths
using the same format that appears in the HTML hypertext link. If
the link is relative, change it to absolute to use it with the -include option.
-indexcludeSyntax-indexclude exp_1 [exp_n] ... Specifies that the files and paths in URLs that match the expressions are not indexed. They are, however, still followed. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). To use regular expressions, also specify the -regexp option. You would use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing. Where the -exclude option prevents Verity Spider from even following anything that matches the specified expressions, the -indexclude option allows Verity Spider to follow anything while only skipping that which matches the specified expressions. For document types, use the -indmimeexclude option instead. Note: When specifying a URL, use full,
absolute paths using the same format as appears in the HTML hypertext
link. If the link is relative, change it to absolute to use it with -indexclude.
-indincludeSyntax-indinclude exp_1 [exp_n] ... Specifies that only those files and paths in URLs that match the expressions be followed and indexed. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). To use regular expressions, also specify the -regexp option. Where the -include option prevents Verity Spider from even following anything that does not match the specified expressions, the -indinclude option allows Verity Spider to follow anything while only indexing that which matches the specified expressions. ExampleIf you want to index all documents that include "search" in the URL at http://web.verity.com, you cannot use the following: vspider -collection collname -start http://web.verity.com -include '*search*' This is because the starting point does not match the -include option criteria. Instead, use the -indinclude option to follow all documents (unless you have specified any of the exclude options) and index only those documents that match your criteria. Replace the -include option with the -indinclude option in the preceding example. Note: When specifying a URL, use full, absolute paths
using the same format that appears in the HTML hypertext link. If
the link is relative, change it to absolute to use it with the -indinclude option.
-indmimeexcludeSyntax-indmimeexclude mime_1 [mime_n] ... Specifies that only those MIME types that match the expressions be followed but not indexed. In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). Use this option to gather some documents, such as HTML tables of contents, to gain access to other documents for indexing. The -mimeexclude option, on the other hand, prevents specified documents from being followed at all. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example: 'text/*' You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions. -indmimeincludeSyntax-indmimeinclude mime_1 [mime_n] ... Specifies that only those MIME types that match the expressions are followed and indexed. The -mimeinclude option does not let you index desired documents if the starting URL is not followed. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example: 'text/*' In Windows, include double-quotation marks around the argument to protect the special character (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). You cannot use the question mark (?) wildcard, and the -regexp option does not allow you to use regular expressions. ExampleIf you want to index all Word documents at http://web.verity.com, you cannot use: vspider -collection collname -style style_dir -start http://web.verity.com -mimeinclude 'application/msword' This is because the starting point does not match the -mimeinclude criteria. You can use the ‑indmimeinclude option to follow all documents (unless you have specified any of the exclude options) and index only those documents that match your criteria. Replace the -mimeinclude option with the -indmimeinclude option in the preceding example. -indskipTypeWeb crawling only Specifies that Verity Spider follow and parse links, but not index, any HTML document that contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option. You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path To use regular expressions, also specify the -regexp option. Example 1To skip all HTML documents that contain the word "personnel" in the Title element, while still parsing those documents for links to other documents, use the following: -indskip title "personnel" Example 2To avoid indexing directory listing pages, while still parsing the document and path links except for the link to the parent directory, use one of the following, depending on the web server being indexed:
-metafileSyntax-metafile path_and_filename Lets you use a text file to map custom meta tags to valid HTTP header fields. If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path This means that you can use your own meta tag, in the document, to replace what is returned by the web server, or to insert it if nothing is returned. Currently, the only header fields of real value are "Last-Modified" and "Content-Length." Future enhancements could allow for greater variety. The following is the syntax for entries in the text file: name Last-Modified y|n or name Content-Length y|n Where y|n is an override flag, which can be yes or no. ExampleA mapping file for the -metafile option might include the following: Doc_Last_Touched Last-Modified n Doc_Size Content-Length y If you use the y override flag, the value for the custom meta tag overrides the value for the valid field, even if both values are present and differ. This can be useful when the valid field value is always sent, but you want to specify your own value with a custom meta tag. If you use the n override flag, the value for the custom meta tag is used only if there is no value for the valid field returned by the server. If a value for the valid field exists, it is given precedence. Note: If you have several entries mapping to the same
valid field, only the last entry takes effect.
-mimeexcludeSyntax-mimeexclude mime_1 [mime_n] ... Specifies MIME types that are not followed or indexed. In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). The default is to include all MIME types. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example: 'text/*' You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions. Use the -indmimeexclude option to allow Verity Spider to follow documents, without indexing them, to gain access to other desirable document types. -mimeincludeSyntax-mimeinclude mime_1 [mime_n] ... Specifies MIME types to be included. In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). The default is to include all MIME types. For the MIME variable, you can include the asterisk (*) wildcard for text strings; for example: 'text/*' You cannot use the question mark (?) wildcard, and the -regexp option does not let you use regular expressions. -skipSyntax-skip HTML_tag "exp" Specifies that Verity Spider not index any HTML document that contains the text of exp within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the -skip option. You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single characters; for example: '/my_doc*/year199?' In Windows, include double-quotation marks around the argument to protect the special characters, such as the asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks are not necessary within a command file (the -cmdfile option). If you use backslashes, double them so that they are properly escaped; for example: C:\\test\docs\path To use regular expressions, also specify the -regexp option. Example 1To skip all HTML documents that contain the word "personnel" in the Title element, use the following: -skip title "personnel" |