About Verity Spider



Verity Spider enables you to index web-based and file system documents throughout your enterprise. Verity Spider lets you index more than two hundred application document formats, including Microsoft Office, WordPerfect, ASCII text, HTML, SGML, XML, and PDF (Adobe Acrobat) documents.

Another advantage of this method, is that the index that the vspider command creates includes dynamic content. Using the cfindex tag and indexing a collection through the ColdFusion Administrator do not include dynamic content.

The Verity Spider that is included with ColdFusion is licensed for websites that are defined and reside on the same host on which ColdFusion is installed. Contact Verity Sales for licensing options regarding the use of Verity Spider for external websites.

Web standard support

Verity Spider supports key web standards used by Internet and intranet sites. Standard href links and frames pointers are recognized, so that navigation through them is supported. Redirected pages are followed so that the real underlying document is indexed. Verity Spider adheres to the robots exclusion standard specified in robots.txt files, so that administrators can maintain friendly visits to remote websites. Http Basic Authentication mechanism is supported so that password-protected sites can be indexed.

Restart capability

When an indexing job fails, or for some reason Verity Spider cannot index a significant number or type of URLs, you can restart the indexing job to update the collection. Only those URLs that were not successfully indexed previously are processed.

State maintenance through a persistent store

Verity Spider stores the state of gathered and indexed URLs in a persistent store, which lets it track progress for the purposes of gracefully and efficiently restarting stopped indexing jobs.

Performance

Verity Spider performance is greatly improved over previous versions, because of low memory requirements, flow control, and the help of multi-threading and efficient Domain Name System (DNS) lookups.

Flow control

When indexing websites, Verity Spider distributes requests to web servers in a round-robin manner. This means that one URL is fetched from each web server in turn. With flow control, a faster website can finish before a slower one. The Verity Spider optimizes indexing on every web server.

Verity Spider adjusts the number of connections per server depending on the download bandwidth. When the download bandwidth from a web server falls below a certain value, Verity Spider automatically scales back the number of connections to that web server. There is always at least one connection to a web server. When the download bandwidth increases to an acceptable level, Verity Spider reallocates connections (per the value of the -connections option, which is 4 by default). You can turn off flow control with the -noflowctrl option.

Multi-threading

Verity Spider separates the gathering and indexing jobs into multiple threads for concurrence. Additionally, Verity Spider can create concurrent connections to web servers for fetching documents, and have concurrent indexing threads for maximum utilization. This translates to an overall improvement in throughput.

Efficient DNS lookups

Verity Spider minimizes DNS lookups, which means great improvements to lookups throughput. If lookups are limited by domain or host, then no DNS lookups are made on hosts that fall outside that range. In earlier versions, DNS lookups were made on all candidate URLs.

Proxy handling efficiency

To allow for greater flexibility when dealing with indexing jobs that involve proxy servers and firewalls, use the following options:

-noproxy
To reduce proxy checking for certain hosts

-proxyauth
To authenticate on proxy servers