Refining your searches with zones and fields



One of the strengths of Verity is its ability to perform full-text searches on documents of many formats. However, sometimes you want to restrict a search to certain portions of a document, to improve search relevance. If a Verity collection contains some documents about baseball and other documents about caves, a search for the word bat can retrieve several irrelevant results.

If the documents are structured documents, you can take advantage of the ability to search zones and fields. The following are some examples of structured documents:

  • Documents created with markup languages (XML, SGML, HTML)

  • Internet Message Format documents

  • Documents created by many word-processing applications

Note: Although your word processor opens with what appears to be a blank page, the document has many regions such as title, subject, and author. Refer to the documentation of your application or online help system for how to view a document’s properties.

Zone searches

You can perform zone searches on markup language documents. The Verity zone filter includes built-in support for HTML and several file formats; for a list of supported file formats, see Building a Search Interface. Verity searches XML files by treating the XML tags as zones. When you use the zone filter, the Verity engine builds zone information into the collection’s full-word index. This index, enhanced with zone information, permits quick and efficient searches over zones. The zone filter can automatically define a zone, or you can define it yourself in the style.zon file. You can use zone searching to limit your search to a particular zone. This can produce more accurate, but not necessarily faster, search results than searching an entire file.

Note: The contents of a zone cannot be returned in the results list of an application.

Examples

The following examples perform zone searching on XML files. In a list of rock bands, you could have XML files with tags for the instruments and for comments. In the following XML file, the word Pete appears in a comment field:

<band.xml> 
        <Lead_Guitar>Dan</Lead_Guitar> 
        <Rhythm_Guitar>Jake</Rhythm_Guitar> 
        <Bass_Guitar>Mike</Bass_Guitar> 
        <Drums>Chris</Drums> 
        <COMMENT_A>Dan plays guitar, better than Pete.</COMMENT_A> 
        <COMMENT_B>Jake plays rhythm guitar.</COMMENT_B> 
</band.xml>

The following CFML code shows a search for the word Pete:

<cfsearch name = "band_search" 
    collection="my_collection"  
    type = "simple" 
criteria="Pete">

The above search for Pete returns this XML file because this search target is in the COMMENT_A field. In contrast, Pete is the lead guitarist in the following XML file:

<band.xml> 
        <Lead_Guitar>Pete</Lead_Guitar> 
        <Rhythm_Guitar>Roger</Rhythm_Guitar> 
    <Bass_Guitar>John</Bass_Guitar> 
        <Drums>Kenny</Drums> 
        <COMMENT_A>Who knows who's better than this band?</COMMENT_A> 
        <COMMENT_B>Ticket prices correlated with decibels.</COMMENT_B> 
</band.xml>

To retrieve only the files in which Pete is the lead guitarist, perform a zone search using the IN operator according to the following syntax:

(query) <IN> (zone1, zone2, ...)
Note: As with other operators, IN might be uppercase or lowercase. Unlike AND, OR, or NOT, enclose IN within brackets.

Thus, the following explicit search retrieves files in which Pete is the lead guitarist:

(Pete) <in> Lead_Guitar

This is expressed in CFML as follows:

<cfsearch name = "band_search" 
    collection="my_collection"  
    type = "explicit" 
    criteria="(Pete) <in> Lead_Guitar">

To retrieve files in which Pete plays either lead or rhythm guitar, use the following explicit search:

(Pete) <in> (Lead_Guitar,Rhythm_Guitar)

This is expressed in CFML as follows:

<cfsearch name = "band_search" 
    collection="bbb"  
    type = "explicit" 
    criteria="(Pete) <in> (Lead_Guitar,Rhythm_Guitar)">

Field searches

Fields are extracted from the document and stored in the collection for retrieval and searching, and can be returned on a results list. Zones, on the other hand, are merely the definitions of “regions” of a document for searching purposes, and are not physically extracted from the document in the same way that fields are extracted.

You must define a region of text as a zone before it can be a field. Therefore, it can be only a zone, or it can be both a field and a zone. Whether you define a region of text as a zone only or as both a field and a zone depends on your particular requirements.

A field must be defined in the style file, style.ufl, before you create the collection. To map zones to fields (to display field data), define and add these extra fields to style.ufl.

You can specify the values for the cfindex attributes TITLE, KEY, and URL as document fields for use with relational operators in the criteria attribute. (The SCORE and SUMMARY attributes are automatically returned by a cfsearch; these attributes are different for each record of a collection as the search criteria changes.) Text comparison operators can reference the following document fields:

  • cf_title

  • cf_key

  • cf_url

  • cf_custom1

  • cf_custom2

  • cf_custom3

  • cf_custom4

Text comparison operators can also reference the following automatically populated document fields:

  • title

  • key

  • url

  • vdksummary

  • author

  • mime-type

To explore how to use document fields to refine a search, consider the following database table, named Calls. This table has four columns and three records, as the following table shows:

call_ID

Problem_Description

Short_Description

Product

1

Can’t bold text properly under certain conditions

Bold Problem

HomeSite+

2

Certain optional attributes are acting as required attributes

Attributes Problem

ColdFusion

3

Can’t do a File/Open in certain cases

File Open Problem

HomeSite+

A Verity search for the word certain returns three records. However, you can use the document fields to restrict your search; for example, a search to retrieve HomeSite+ problems with the word certain in the problem description.

These are the requirements to run this procedure:

  • Create and populate the Calls table in a database of your choice

  • Create a collection named Training (you can do this in CFML or in the ColdFusion Administrator).

The following table shows the relationship between the database column and cfindex attribute:

Database column

The cfindex attribute

Comment

call_ID

key

The primary key of a database table is often a key attribute.

Problem_Description

body

This column is the information to be indexed.

Short_Description

title

A short description is conceptually equivalent to a title, as in a running title of a journal article.

Product

custom1

This field refines the search.

You begin by selecting all data in a query:

<cfquery name = "Calls" datasource = "MyDSN"> 
    Select * from Calls 
</cfquery>

The following code shows the cfindex tag for indexing the collection (the type attribute is set to custom for tabular data):

<cfindex 
    query = "Calls" 
    collection = "training" 
    action = "UPDATE" 
    type = "CUSTOM" 
    title = "Short_Description" 
    key = "Call_ID" 
    body = "Problem_Description" 
    custom1 = "Product">

To perform the refined search for HomeSite+ problems with the word certain in the problem description, the cfsearch tag uses the CONTAINS operator in its criteria attribute:

<cfsearch 
    collection = "training" 
    name = "search_calls" 
    criteria = "certain and CF_CUSTOM1 <CONTAINS> HomeSite">

The following code displays the results of the refined search:

<table border="1" cellspacing="5"> 
<tr> 
    <th align="LEFT">KEY</th> 
    <th align="LEFT">TITLE</th> 
    <th align="LEFT">CUSTOM1</th> 
</tr> 
 
<cfoutput query = "search_calls"> 
<tr> 
    <td>#KEY#</td> 
    <td>#TITLE#</td> 
    <td>#CUSTOM1#</td> 
</tr> 
</cfoutput> 
</table>