|
About Verity Spider syntax
Before you create an indexing task for a new collection,
make copies of the relevant default style files to ensure that you
have a set of template style files in a known, stable state.
Running multiple simultaneous Verity Spider jobs can cause performance problems
for searches. This does not mean that you should never run indexing jobs
when users might be searching, because your collections are available
for searching even while indexing jobs are running. To optimize
performance, try staggering your indexing jobs to avoid overloading
your server.
The Verity Spider commandThe vspider executable file,
which starts the Verity Spider utility, is located in the platform/bin
directory, as follows:
- Server and multiserver configuration
- The vspider.exe (Window) or vspider (UNIX) file is located
in cf_root/verity/k2/platform/bin (server configuration)
or jrun_root/verity/k2/platform/bin (multiserver configuration)
where platform is _nti40 for Windows, _solaris for Solaris,
or _ilnx21 for Linux.
- J2EE configuration
- The vspider.exe (Window) or vspider (UNIX) file is located
in verity_root/k2/platform/bin where platform is
_nti40 for Windows, _solaris for Solaris, or _ilnx21 for Linux.
At
its most basic level, a Verity Spider command consists of the following:
vspider -initialize -collection coll [options]
Where -initialize is -start or -refresh (when
starting points have changed), and ‑collection is
required to provide a target for the Verity Spider, and [options] can
be a near-limitless combination of the options described later.
For
example:
c:\coldfusion9\verity\k2\_nti40\bin\vspider -common c:\coldfusion9\verity\k2\common
-collection c:\new -start http://localhost -indinclude *
Dependencies
exist for other options, depending on the nature of the indexing task.
The following are some examples:
To build a new collection,
use -style.
To control how Verity Spider operates, including which documents
it indexes, use some Verity Spider options.
If you
do not run the Verity Spider executable from its default installation directory,
include that directory in your path. This is because the Verity
Spider executable depends on other files to run properly.
To
use the vspider command on UNIX and Linux, the
directory that contains the libvdk30.so file must be in your LD_LIBRARY_PATH variable.
In the server configuration, this directory is cf_root/verity/k2/platform/bin;
in the multiserver configuration, this directory is jrun_root/servers/cfusion/WEB-INF/cfusion/verity/k2/platform/bin.
For example, in the server configuration on Linux, this directory
is cf_root/verity/k2/_ilnx21/bin.
Using a command fileFor
simpler reuse and archiving of your indexing commands, use the -cmdfile option
for abstraction. By using an ASCII text file to store a task’s options,
you avoid the potential problem of using special characters in an
option’s parameter value. For example, the -processbif option
requires the use of "!*" and therefore any task using that option
must also use the -cmdfile option.
command-line option referenceVerity Spider V 5.0 command-line options are case sensitive.
-startSpecifies a starting point for an indexing job. You can
specify multiple instances, or use multiple values in a single instance.
When you execute an indexing job from a command line, and you
do not use a command file (with the -cmdfile option),
you must URL-escape any special characters in the starting point.
To URL-escape a special character, use "%hex-ASCII-character-number"
in place of the character. For example, use /time%26/ instead of
/time&/. This allows the operating system to properly process
the command string.
If an indexing task stops, you can rerun the task as-is. The
persistent store for the specified collection is read, and only
those candidate URLs that are in the queue but not yet processed
are parsed. Candidate URLs correspond to URLs of the following status,
as reported by vsdb:
cand, used, inse, upda, dele, fail
Repository type
|
Starting point
|
Web
|
The URL or URLs from which Verity Spider
is to begin indexing. Use other options, such as the -jumps option, to
control how far from the starting point Verity Spider goes.
|
File
|
The starting directory or directories in
which Verity Spider start indexing. All subdirectories beneath the
starting point are indexed, unless you use the -pathlen option
or any of the inclusion or exclusion criteria.
|
Note: By using the -start option
with the -refresh option, you provide a starting
point for Verity Spider and therefore do not need to use at least
one of the following options: -host, -domain, -nofollow,
or -unlimited.
-refreshUsed for updating a collection, specifies that Verity Spider
process only those documents that qualify, as follows:
They are new documents in the repository, and they qualify
for indexing under the criteria.
They exist in the collection and are recorded in the Verity
Spider persistent store with a status of done. If Verity Spider
determines that these indexed documents have been updated in the
repository, then they are retrieved again to be reparsed and reindexed.
The document VdkVgwKey values do not change.
They are deleted in the collection. If Verity Spider determines
that documents have been deleted from the repository, then they
are also deleted from the persistent store and the collection. The
exception to this rule is when you use the -nooptimize option
with the -refresh option. In this case, any document
deleted from the repository is marked for deletion in the collection. It
is removed from the collection and the persistent store when the
next indexing task is run for the collection.
When you rerun an existing indexing job, Verity Spider automatically
refreshes the collection. If you add or remove any of the starting
points, however, you must manually specify the -refresh option
to refresh existing documents.
Note: You can also use the -start option
to provide a starting point for Verity Spider. If you do not use
the -start option, use at least one of the following
options: -host, -domain, or -nofollow.
For further control, also see the -refreshtime option.
If you do not use any constraint criteria, Verity Spider operates
without limits and indexes far more than you intended.
|