|
Using the mkvdk utility
The mkvdk utility is an indexing application, provided
with other Verity utilities, that you can use to create and maintain
collections. It is a command-line utility that you can use within
other applications or shell scripts to provide more sophisticated
scheduling and other capabilities.
The
mkvdk executable file, which starts the mkvdk utility, is located
in the platform/bin directory. For more information on the
specific location of this directory, see Location of Verity utilities.
Note: To display a list of mkvdk command-line options,
enter the following command: mkvdk -help
The mkvdk utility syntaxThe
following is the basic syntax of the mkvdk command:
mkvdk -collection path [option] [dockey]
Multiple options and dockeys can be included, as needed. If dockey
is a list of files, it should consist of an at sign (@) followed
by the filename that contains a simple list of files (for example,
@filelist). For more information about the options for the mkvdk
utility, see Getting started with the Verity mkvdk utility.
The following operations occur when you use the mkvdk utility
to create a collection:
New collection directories are created and the specified
style files are copied to the style subdirectory.
The style file settings are read and the required information
is passed to the Verity search engine.
The gateway is used to open the document files, which are
parsed according to the settings in various style files.
A new partition is created, which includes an index and an
attribute table.
Assist data is generated, which might include a spanning
word list.
When problems
occur during an operation, the mkvdk utility writes error messages
to the system log file (sysinfo.log). You can direct error and other messages
to the console by using the mkvdk command with the -outlevel option.
You can direct messages to a file of your choice by using the -loglevel and -logfile options.
The log file contains the following fields:
Date
Time
Level
Code
Component
Description
You can use the log file to view details about what happens during
the collection creation process. Use the mkvdk -loglevel command
and specify the numeric identifier for the message level you want,
as summarized in the following table:
Type
|
Number
|
Fatal
|
1
|
Error
|
2
|
Warning
|
4
|
Status
|
8
|
Info
|
16
|
Verbose
|
32
|
Debug
|
64
|
To calculate the numeric parameter, add the numbers for the message
types you want to include. The default for both -outlevel and -loglevel is
15, which selects fatal, error, warning, and status messages (1+2+4+8).
Getting started with the Verity mkvdk utilityThe
following is the basic mkvdk syntax:
mkvdk -collection path [option] [...] [filespec] [...]
Where:
Brackets ( [ ] ) indicate optional items.
An ellipsis (...) indicates repetition of the previous item.
Thus, [filespec] [...] indicates an optional series of filespec
items.
filespec represents a document filename or a list of document
filenames. If filespec is a list of files, it should consist of
an at sign (@) followed by the filename containing the list (for
example, @filelist).
The -collection path argument creates or
opens a collection. This argument is required.
Numerous optional syntax options are listed. All syntax options
must precede the first filespec parameter.
Creating a collectionCreating a collection with
the mkvdk utility involves setting up a collection directory structure
and inserting documents into this structure. You can create a collection
using the following steps.
Set up a collection using the following syntax:
mkvdk -create -collection collectionname
Where collectionname is
the path to the collection directory. Running this command creates
a collection directory that includes style files with configuration
information.
Insert
documents using the following syntax:
mkvdk -collection collectionname -bulk -insert filespec
Where filespec is
the name of a bulk insert file that specifies which documents to
index and insert into the collection.
Collection setup optionsThe mkvdk utility
provides the following collection setup options:
Option
|
Description
|
-create
|
Creates a collection in the specified collection
directory. It creates the directory structure, determines the index
contents and sets up the document’s table schema according to the
style files used. If the specified collection exists, the mkvdk
utility exits rather than overwriting the existing collection.
|
-style dir
|
Specifies the style directory that contains
the style files to use to create a collection. This option can only
be used with the -create option. If you do not
specify this option when you use the mkvdk utility to create a collection,
the mkvdk utility uses the style files in the common/style directory.
|
-description desc
|
Sets the collection’s description. Enter
alphanumeric text, such as “This collection contains electronic
mail from ABC Company.” Include the quotation marks.
|
-words
|
Builds the word list for all partitions
in the collection.
|
Examples: setting up collectionsThe
following examples show the commands for creating a collection and building
the word list:
- Creating a collection
- The following command creates a collection in path_2 using
the style files in path_1, and submits and indexes the documents
in filespec:
mkvdk -create -style path_1 -collection path_2 filespec
- Building the word list
- The following command builds the word list in the collection
residing in the path directory:
mkvdk -words -collection path
General processing optionsThe
mkvdk utility provides the following general processing options:
Option
|
Description
|
-collection path
|
Specifies the path of the collection to
create or open. This option is required to execute the mkvdk utility.
|
-nolock
|
Turns off file locking. Locking is on by
default.
|
-synch
|
Performs work immediately. If this option
is not used, indexing work is done in the background, as time permits.
|
-about
|
Shows information about the collection,
such as its description and the date when it was last modified.
|
-datapath path
|
Specifies the datapath to use to find documents
that are added to the specified collection. All relative document paths
are relative to this setting. If you do not set this option, the
mkvdk utility looks for documents next to the collection directory.
|
-topicset path
|
Creates a topic index for the collection,
based on the specified topic set, and stores it in the collection
directory. This facilitates quick and efficient searches over the
collection data when using topics.
|
-mode mode
|
Sets the indexing mode. Values are not case
sensitive. The following are the valid settings:
The
default is Generic mode.
|
-common
|
Specifies the path of the Verity common
directory. If you do not use this option, the Verity engine looks
for the common directory in the directory containing the mkvdk executable,
and then along the executable search path. The executable search
path depends on your operating system environment settings. It is
the path used by the OS to find the programs you run.
|
-help
|
Displays the mkvdk utility syntax options.
|
-debug
|
Runs the mkvdk command in debugging mode.
|
-nooptimize
|
Prevents optimization by this instance of
the mkvdk utility. Using this option turns off the service-level VdkServiceType_Optimize.
The service types determine the type of work the Verity engine and
its self-administration features execute on a collection.
|
-nohousekeep
|
Prevents housekeeping by this instance of
the mkvdk utility. Housekeeping includes deleting files that are
no longer needed. Using this option turns off the service-level
VdkServiceType_DBA. (Service types are described under -nooptimize.)
|
-noindex
|
Prevents indexing by this instance of mkvdk.
Documents are not inserted or deleted. Using this option turns off the
service-level VdkServiceType_Index. (Service types are described
under -nooptimize.)
|
-charmap name
|
Specifies the name of the character set
to which to map all strings for your application. Set this to a
character set that your system can display properly. Using the search
engine with the English locale, the character set that any version
of Windows displays is 8859. This is NOT the name of the character
set of documents being indexed, it is only the name of the character
set that your display can handle properly. (The character set of
the document is set in the style.dft file using the /charmap option.)
Valid
options are 850 and 8859. The default is no mapping.
|
-locale name
|
Specifies the name of the Verity locale
for the mkvdk utility. The locale name must correspond to the name
of an existing locale directory, which must exist in the install_dir/common/locale
directory. Valid options are english, deutsch, and francais. The
default is english.
|
-datefmt format
|
Converts a date field value into Verity’s
internal data representation. You can use this option with the mkvdk options -extract (for
the field extraction feature) and -bulk (for the
bulk submit feature). The named format string identifies to the
date parsing routines in what order dates are written when the date
string only consists of a sequence of numbers (for example, 03/03/96).
Valid options are described in Date format options. The default is MDY.
|
-servlev level
|
Specifies service level. The specifier,
level, is a string consisting of keywords separated by hyphens,
such as search-index-optimize. Valid keywords are described in Service-level keyword options.
|
Examples: processing documentsThe
following examples show the commands for processing documents.
Using the default optionsBy default, the mkvdk command submits and indexes documents
specified in the command, and services the specified collection.
The following command executes the default options:
mkvdk -collection path filespec
Servicing onlyThe following command performs servicing only. Use this
command to only index submitted documents and service the collection:
mkvdk -collection path
Deleting documents from a collectionThe following command deletes documents from a collection:
mkvdk -delete -collection path filespec
Bulk inserting or deletingThe following command specifies bulk insertion of a list
of documents:
mkvdk -collection coll -bulk -insert filespec
Where filespec is the list of files to insert. Since insert is
the default, the following command is equivalent to the preceding
command:
mkvdk -collection coll -bulk filespec
The following command specifies bulk deletion of a list of documents:
mkvdk -collection coll -bulk -delete filespec
Where filespec is the list of files to delete. It can be the
same file used to insert documents; the only difference is that -delete is
specified instead of -insert (or no specification).
Date format optionsThe
Verity engine supports many import date formats, including many
textual date formats, and the numeric date formats listed in the
following table:
Format variable
|
Description
|
MDY
|
Dates written as month-day-year (US format,
the default)
|
DMY
|
Dates written as day-month-year (European
format)
|
YMD
|
Dates written as year-month-day (ISO international
format)
|
YDM
|
Dates written as year-day-month (Swedish
format)
|
USA
|
Dates written in US format (the same as
MDY)
|
EUR
|
Dates written in European format (the same
as DMY)
|
Service-level keyword optionsThe
following table describes the valid keywords for the -servlev keyword:
Keyword
|
Description
|
search
|
Enables search and retrieval
|
insert
|
Enables adding and updating documents
|
optimize
|
Enables opportunist collection optimization
|
assist
|
Enables building of word list
|
housekeep
|
Enables housekeeping of unneeded files
|
delete
|
Enables document deletion
|
backup
|
Enables backup
|
purge
|
Enables background purging
|
repair
|
Enables collection repair
|
dataprep
|
Same as search-index-optimize-assist-housekeep
|
index
|
Same as insert-delete
|
Message optionsThe
mkvdk utility provides the following messaging options:
Option
|
Description
|
-quiet
|
Displays only fatal and error messages to
the console. It overrides the
-outlevel setting.
For a list of message types, see the table in The mkvdk utility syntax.
|
-outlevel (num)
|
Indicates which message types to display
to the console. Valid values are determined by adding the numbers
that correspond to the desired message types. The default value
is 15. For more information, see the table in The mkvdk utility syntax.
|
-logfile filename
|
Saves messages in the specified file.
|
-loglevel (num)
|
Indicates which message types to route to
the optional log file. Valid values are determined by adding numbers together
that correspond to the desired message types. The default value
is 15. For more information, see the table in The mkvdk utility syntax.
|
Document processing optionsThe
mkvdk utility provides the following document processing options:
Option
|
Description
|
-extract
|
Extracts field values from documents, using
the field extraction rules specified in the style.tde file.
|
-insert
|
Adds documents to the collection. This is
the default option for the mkvdk command.
|
-update
|
Adds documents to the collection by replacing
all previous information about the specified documents.
|
-delete
|
Marks the specified documents as deleted,
and makes them unavailable for searches. To actually remove deleted documents
from the collection’s internal documents table and word indexes,
use the squeeze keyword (see Squeezing deleted documents).
|
-nosave
|
Specifies that a work list, which is generated
by the mkvdk utility automatically when you use the -extract option,
is not saved in the collection directory in a file called worklist
(in the Verity bulk submit file format). By default, the mkvdk utility
saves the worklist in the worklist file.
|
-nosubmit
|
Specifies that a work list, which is generated
by the mkvdk utility automatically when you use the -extract option,
is not submitted to the indexing engine and is saved in the collection
directory in a file called worklist (in the Verity bulk submit file
format). This option allows the mkvdk utility to process field extraction
separately from other indexing tasks.
|
Bulk submit optionsThe
mkvdk utility provides the following bulk submit options:
Option
|
Description
|
-bulk
|
Interprets filespec as a bulk submit file.
You can use this option with the -insert, -update,
and -delete options.
|
-offset num
|
Specifies the offset into a bulk submit
file or files. If you specify multiple bulk submit files and use
the -offset option, the offset is applied to all
the bulk submit files.
|
-numdocs num
|
Specifies the number of documents to insert
or delete from the bulk insert file or files. If you specify multiple
bulk insert or delete files and use the -numdocs option,
the -numdocs setting is applied to all the bulk
insert or delete files.
|
-autodel
|
Deletes the bulk submit file or files when
the bulk submission work has finished.
|
Use bulk insert and delete optionsUse
the bulk submit feature to populate fields. The bulk submit feature
supports the insertion of documents and related field values into
collections.
Define the fields in the style.sfl and style.ufl file,
as appropriate.
Create a bulk submit file that specifies the documents to
insert and the field values for each document.
Run the mkvdk utility using the -bulk option
and specifying the bulk submit file or files.
Collection maintenance optionsThe mkvdk
utility provides the following collection maintenance options:
Option
|
Description
|
-backup dir
|
Backs up the collection into the specified
directory. The backup does not include the tde subdirectory. The
tde subdirectory is created by and for Topic Document Entry if Topic
Document Entry is used to create or maintain the collection.
|
-repair
|
Repairs the collection, performed by an
API call.
|
-purge
|
Waits the amount of time specified by the -purgewait option
and then deletes all documents in the collection, but not the collection
itself. It leaves the collection directory structure intact.
To
specify a different wait period, use the -purgewait option
instead of the
-purge option. If you do
not use the -purgewait option, the default is 600
seconds.
|
-purgeback
|
Used with the -purge option,
performs a purge in the background.
|
-purgewait sec
|
Specifies to the -purge option
how many seconds to wait. If you do not specify sec, the default
is 600.
|
-noservice
|
Prevents collection servicing, which includes
indexing, by this instance of the mkvdk command, performed by an API
call.
|
-persist
|
Services the collection repeatedly, at default
intervals of 30 seconds. Use the -sleeptime option
to set a different interval.
|
-sleeptime sec
|
Specifies the interval between service calls
when the mkvdk utility is run with the -persist option.
|
-optimize spec
|
Performs various optimizations on the collection,
depending on the value of spec. The specifier, spec, is a string consisting
of keywords separated by hyphens, such as maxmerge-squeeze-readonly.
For valid keywords, see Optimization keywords.
|
-noexit
|
Windows only. Causes the I/O window to remain
after the program has finished. By default, the window closes and the
program exits, so that scripts calling the mkvdk utility do not
hang.
|
Examples: maintaining collectionsThe following examples show
the commands for maintaining a collection.
Repairing a collectionThe following command
automatically repairs a collection, or enables it after manual repairs:
mkvdk -repair -collection path
Backing up a collectionThe following command
backs up a collection to the specified directory:
mkvdk -backup path_1 -collection path_2
Deleting a collectionTo delete a collection,
use the appropriate command for your operating system. For example,
to remove the collection directory structure and control files on
a UNIX system, use the following command:
rm -r -collection_path
Purging a collectionThe following command deletes all documents from a collection,
but does not delete the collection itself:
mkvdk -purge -collection path
Purging a collection in the backgroundThe following command purges the specified collection in
the background:
mkvdk -purge -purgeback -collection path
Specifying persistent serviceThe following command runs the mkvdk command
as a persistent process, so that servicing is performed repeatedly
after num idle seconds:
mkvdk -persist -sleeptime num -collection path
Deleting a collectionThe -purge option
deletes all documents in a collection, but does not delete the collection
itself. To delete a collection, use operating system commands, such as
the rm command on UNIX, to remove the collection
directory structure and control files.
Optimization keywordsThe
following table describes the optimization keywords for the -optimize option:
Keyword
|
Description
|
maxclean
|
Performs the most comprehensive housekeeping
possible, and removes out-of-date collection files. Adobe recommends
this optimization only when you are preparing an isolated collection
for publication. When using this type, if the collection is being
searched, files sometimes get deleted too early, which can affect
search results.
|
maxmerge
|
Performs maximal merging on the partitions
to create partitions that are as large as possible. This creates
partitions that can have up to 64000 documents in them.
|
readonly
|
Marks the collection as read-only and unchanged
after the function call is done. This is appropriate for CD-ROM collections.
|
spanword
|
Creates a spanning word list across all
the collection’s partitions. A collection consists of numerous smaller
units, called partitions, each of which includes a word list. Optionally,
a spanning word list can be built with an ngram index.
|
ngramindex
|
Builds an ngram index for the collection.
An ngram index is designed to improve the search performance for
queries with the <TYPO> and <WILDCARD> operators. An
ngram index cannot be built without a spanning word list. You can
build a spanning word list and ngram index in the same command,
for example:
mkvdk -collection collname -optimize spanword -ngramindex
|
squeeze
|
Squeezes deleted documents from the collection.
Squeezing deleted documents recovers space in a collection, and
improves search performance. (For more information about squeeze,
see Squeezing deleted documents.) Using this option invalidates the search
results.
|
vdbopt
|
Configures the collection’s Verity databases
(VDBs). Each collection consists of smaller units called VDBs. This
keyword has the effect of linearizing the data in a VDB, and making
the collection metadata contained in the VDB more streamlined. It
also lets the VDB grow to a much larger size.
|
tuneup
|
Performs the same as combining the maxmerge,
vdbopt, and spanword keywords.
|
publish
|
Performs the same as all of the optimization
types combined. Use this keyword to optimize the collection for
the best possible retrieval performance, such as for publication
to a network on a server or on a CD‑ROM.
|
Squeezing deleted documentsWhen
a document is deleted from a collection, its space is not recovered.
It is merely marked as deleted and not available for subsequent
searches. Squeezing actually removes deleted documents from the
collection’s internal documents table and word indexes, thus creating
a smaller collection and reducing the collection’s disk space. A
smaller collection has a more efficient structure that makes searching
slightly faster and uses slightly less memory.
You can safely squeeze deleted documents for a collection at
anytime, because the mkvdk utility ensures that the collection is
available for searching and servicing through its self-administration
features. The application does not need to temporarily disable a
collection to squeeze deleted documents, because when a squeeze
request is made, the mkvdk utility assigns a new revision code to
the collection. After a squeeze has occurred, the next time the
application accesses the collection, the Verity engine notifies
the application that dramatic changes have been made, and points
the application to the new collection data.
Squeezing deleted documents out of a collection is a significant
update to the collection. If users are reviewing search results
at the time when squeezing occurs, the search results might be invalidated
after the squeeze operation.
Optimized Verity databasesThe
Verity database (VDB) is the fundamental storage mechanism responsible
for supporting dynamic access to documents in collections. A VDB
consists of simple tables with rows and columns that relate to each
other by row position. VDB tables are not relational, and their
architecture supports quick and efficient searching over textual
data. A VDB consists of segments that are packed into a single file.
One of the advantages of having one packed VDB file is optimized search
performance. The fewer files that need to be opened during search processing,
the faster the search performance.
The VDB optimization option optimizes the packing of a collection’s
VDBs. When VDBs are built during normal indexing operations, the
segments are not stored sequentially in the one-file VDB file system.
As a result of VDB optimization, performance can be improved by
reserializing the packed segments in the VDBs so that all segments
are contiguous, and VDBs can grow in size. Optimized VDBs can grow
up to 2 GB, as opposed to the maximum 64 MB for an unoptimized VDB.
Using this option might degrade your indexing performance when
certain indexing modes are set for the collection.
Performance tuning optionsThe
mkvdk utility provides the following performance tuning options:
Option
|
Description
|
-maxfiles num
|
Sets the maximum number of files that the
mkvdk utility can have open at once. The default is 50.
|
-diskcache num
|
Sets the size of the mkvdk disk cache in
kilobytes. The default is 128.
|
|