About Verity Spider
Verity
Spider enables you to index web-based and file system documents throughout
your enterprise. Verity Spider lets you index more than two hundred application
document formats, including Microsoft Office, WordPerfect, ASCII text,
HTML, SGML, XML, and PDF (Adobe Acrobat) documents.
Another advantage of this method, is that the index that the vspider command
creates includes dynamic content. Using the cfindex tag
and indexing a collection through the ColdFusion Administrator do
not include dynamic content.
The Verity Spider that is included with ColdFusion is licensed
for websites that are defined and reside on the same host on which
ColdFusion is installed. Contact Verity Sales for licensing options
regarding the use of Verity Spider for external websites.
Web standard support
Verity
Spider supports key web standards used by Internet and intranet
sites. Standard href links and frames pointers are recognized, so
that navigation through them is supported. Redirected pages are
followed so that the real underlying document is indexed. Verity
Spider adheres to the robots exclusion standard specified in robots.txt
files, so that administrators can maintain friendly visits to remote
websites. Http Basic Authentication mechanism is supported so that
password-protected sites can be indexed.
Restart capability
When
an indexing job fails, or for some reason Verity Spider cannot index
a significant number or type of URLs, you can restart the indexing
job to update the collection. Only those URLs that were not successfully
indexed previously are processed.
State maintenance through a persistent store
Verity
Spider stores the state of gathered and indexed URLs in a persistent
store, which lets it track progress for the purposes of gracefully
and efficiently restarting stopped indexing jobs.
Performance
Verity
Spider performance is greatly improved over previous versions, because
of low memory requirements, flow control, and the help of multi-threading
and efficient Domain Name System (DNS) lookups.
Flow control
When
indexing websites, Verity Spider distributes requests to web servers
in a round-robin manner. This means that one URL is fetched from
each web server in turn. With flow control, a faster website can
finish before a slower one. The Verity Spider optimizes indexing
on every web server.
Verity Spider adjusts the number of connections per server depending
on the download bandwidth. When the download bandwidth from a web
server falls below a certain value, Verity Spider automatically
scales back the number of connections to that web server. There
is always at least one connection to a web server. When the download
bandwidth increases to an acceptable level, Verity Spider reallocates
connections (per the value of the -connections option, which
is 4 by default). You can turn off flow control with the -noflowctrl option.
Multi-threading
Verity
Spider separates the gathering and indexing jobs into multiple threads
for concurrence. Additionally, Verity Spider can create concurrent
connections to web servers for fetching documents, and have concurrent
indexing threads for maximum utilization. This translates to an
overall improvement in throughput.
Efficient DNS lookups
Verity
Spider minimizes DNS lookups, which means great improvements to lookups
throughput. If lookups are limited by domain or host, then no DNS lookups
are made on hosts that fall outside that range. In earlier versions,
DNS lookups were made on all candidate URLs.
Proxy handling efficiency
To
allow for greater flexibility when dealing with indexing jobs that
involve proxy servers and firewalls, use the following options:
- -noproxy
- To reduce proxy checking for certain hosts
- -proxyauth
- To authenticate on proxy servers