Please help us by completing this survey

Go to survey
Documentation

Docs / Development / Field Indexing

Field Indexing

For every Content the Field values can be indexed so that when searching for a value the corresponding Content will appear in the result set. It is also possible to search in Fields by explicitely defining the Field to search for in a query. The way a specific Field of a Content is indexed is defined in the Content Type Definition.

It is possible to switch off indexing for certain fields or content types. In that case nobody will be able to find the instances of those content types using Content Query, but the index will be smaller. For more details, see the Index description in the Content Type Definition article.

The portal uses the Lucene search engine by default for indexing the Content Repository and to provide a fast mechanism for returning query results. Apart from the indexing of some basic built-in properties every Field can be configured to be indexed separately.

Indexing and storing

There are two ways to put Field data information in the index: by indexing and by storing. Indexing means that an analyzer processes Field data, it resolves to data to terms and the Content ID is stored under the corresponding term making it possible to search for terms to get the Content. Storing means that Field data itself can be stored in the index for a Content (for example the base system stores content Path in the index for convenience). Indexing and storing is independent of each other, they can both be switched on and off regardless of the state of the other.

Analyzers

The goal of an analyzer is to extract all relevant terms from a text, filtering stopwords etc. It is important that the same analyzer is used in the indexing process and the query building. For example your document contains the following text: „Writing Sentences” and your query text is „writing”. After analysis the indexed text and search text will be these: „writing, sentences” and „writing”. This method ensures that the original text can be found even if the query word typed in and the word in the original text do not match exactly char-by-char. We use a PerFieldAnalyzerWrapper that can support a unique analyzer for every Field. Analyzer-Field bindings are defined in the CTD. Field without analyzer-binding will be analyzed with the default analyzer: KeywordAnalyzer.

Stop-word dictionary

Some of the built-in analyzers (StandardAnalyzer and StopAnalyzer) use a stop-word dictionary to exclude certain words that will not be indexed as terms. For example when indexing written English texts it is useful not to index the word ‘the’, as it is usually irrelevant in relation to the text content. Besides, searching for ‘the’ would come up with results including Content containing any written English text. The built-in stop-word dictionary contains the following words:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", 
"on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

Custom stop-word dictionaries are not yet supported.

Indexing definition

Below you can see the skeleton of a Field definition with indexing definition included:

    <Field name="" type="">
      ...
      <Indexing>
         <Mode></Mode>
         <Store></Store>
         <TermVector></TermVector>
         <Analyzer></Analyzer>
         <IndexHandler></IndexHandler>
      </Indexing>
    </Field>

You can configure the indexing and storing mode, analyzer, and the association of Field IndexHandler in every Field. Indexing configuration is an optional xml element, with name Indexing, under the Field element after the Bind element (if defined) and before the Configuration element (if defined). Indexing element can contain the following sub elements in this order: Mode, Store, TermVector, Analyzer, IndexHandler. All elements are optional because all elements have default values.

Mode

Indexing mode settings (refer to Lucene indexing documentation). Available values:

This setting is only available to make it easier to configure the indexing subsystem, default install only uses Analyzed and No settings.

Store

The native Field value storage in the index can be switched on or off (refer to Lucene storing documentation). Available values:

Term vector

Term vector settings (refer to Lucene term vector documentation). Available values:

This setting is only available to make it easier to configure the indexing subsystem, default install only uses default setting.

Analyzer

You can associate any Lucene Analyzer to a Field (refer to: Lucene analyzer documentation). The element value is the fully qualified type name of the desired Lucene analyzer. Available analyzers:

The built-in standard analyzer is based upon the English language. Please note that when using the system in different language environments it is reasonable to develop a custom analyzer with stop-word dictionary (and optionally a stemmer) specialized for the given language.

Only one analyzer can be bound to a specific Field, that is this setting cannot be overridden. Changing an analyzer for a Field can only be done at the topmost level the Field is defined. To change an analyzer first re-register the CTD with omitted analyzer settings:

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
      </Indexing>
    </Field>

After registration you may provide the new analyzer settings and reinstall the CTD:

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.Standard.StandardAnalyzer</Analyzer>
      </Indexing>
    </Field>
Warning! Changing an analyzer for a Field is only valid in development time, it should not be carried out on live portals! After changing an analyzer the affected Content should be saved and reindexed - a full index repopulation is highly recommended!

IndexHandler

Every content Field has a corresponding FieldIndexHandler that generates the indexable value from the Field’s value. This association is configurable using the IndexHandler element in the CTD. The element value is the fully qualified type name of the desired FieldIndexHandler. Default depends on the Field’s Field Setting. The master default is the LowerStringIndexHandler (if a Field Setting does not override the CreateDefaultIndexFieldHandler method). Available built-in Field index handlers and their usages:

Indexing of built-in properties

The following is a list of the properties that are indexed regardless of Field indexing settings:

for Developers

The indexing of content is carried out in two steps: first an IndexDocument data is created and stored in the database when the content is saved. After that, this IndexDocument data is used to include the analyzed data in the index. This two-step procedure allows fast index creation using the index SnAdmin tool. Please bear in mind though, that when changing field index configuration the IndexDocuments are not automatically regenerated, so running the index tool with the level DatabaseAndIndex is necessary.

It is also possible to re-create the index of a content or a subtree using the following API:

content.RebuildIndex();

Indexing binaries

Binary fields are special fields that hold the actual content of a file. Indexing these kinds of fields depend on the type of the file (e.g. pdf files need a different algorithm than docx files). For more information about extracting text and the customization possibilities please visit the following article:

Examples

Disabling Field indexing

The following example shows an indexing configuration that disables the indexing of the field:

  <Field name="Versions" type="Reference">
    <Title>Versions</Title>
    <Description>Content version history</Description>
    <Indexing> <!-- Indexing configuration -->
      <Mode>No</Mode>
      <Store>No</Store>
    </Indexing>
    ...
  </Field>

Using StandardAnalyzer

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.Standard.StandardAnalyzer</Analyzer>
      </Indexing>
    </Field>

Enter the following text into the MyKeywords Field:

the testing tesT2 and test3/test4;test5

The following terms will be present in the index:

Fields Text
MyKeywords testing
MyKeywords test2
MyKeywords test3/test4
MyKeywords test5

Different queries will return the following results:

MyKeywords:test
Result count: 0
 
MyKeywords:test*
Result count: 1
 
MyKeywords:testing
Result count: 1
 
MyKeywords:testING
Result count: 1
 
MyKeywords:test2
Result count: 1
 
MyKeywords:test3
Result count: 0
 
MyKeywords:test3*
Result count: 1
 
MyKeywords:test3/test4;test5
Result count: 1
 
MyKeywords:test3/test4;testing
Result count: 0
 
MyKeywords:tested
Result count: 0
 
MyKeywords:"testing tesT2 test3/test4;test5"
Result count: 1

Using KeyWordAnalyzer

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.KeywordAnalyzer</Analyzer>
      </Indexing>
    </Field>

Enter the following text into the MyKeywords Field:

the testing tesT2 and test3/test4;test5

The following terms will be present in the index:

Fields Text
MyKeywords the testing tesT2 and test3/test4;test5

Different queries will return the following results:

MyKeywords:the
Result count: 0
 
MyKeywords:the*
Result count: 1
 
MyKeywords:testing
Result count: 0
 
MyKeywords:testING
Result count: 0
 
MyKeywords:test2
Result count: 0
 
MyKeywords:test3
Result count: 0
 
MyKeywords:test3*
Result count: 0
 
MyKeywords:*test3*
Result count: 1
 
MyKeywords:test3/test4;test5
Result count: 0
 
MyKeywords:"the testing tesT2 and test3/test4;test5"
Result count: 1

Using SimpleAnalyzer

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.SimpleAnalyzer</Analyzer>
      </Indexing>
    </Field>

Enter the following text into the MyKeywords Field:

the testing tesT2 and test3/test4;test5

The following terms will be present in the index:

Fields Text
MyKeywords test
MyKeywords testing
MyKeywords the
MyKeywords and

Different queries will return the following results:

MyKeywords:test
Result count: 1
 
MyKeywords:test*
Result count: 1
 
MyKeywords:testing
Result count: 1
 
MyKeywords:testING
Result count: 1
 
MyKeywords:test2
Result count: 1
 
MyKeywords:test4334
Result count: 1
 
MyKeywords:tester
Result count: 0
 
MyKeywords:*test3*
Result count: 0
 
MyKeywords:test3/test4;test5
Result count: 1
 
MyKeywords:"the testing tesT2 and test3/test4;test5"
Result count: 1

Just to make it clearer: enter the following text into the MyKeywords Field:

helo12bye

The following terms will be present in the index:

Fields Text
MyKeywords helo
MyKeywords bye

Using StopAnalyzer

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.StopAnalyzer</Analyzer>
      </Indexing>
    </Field>

Enter the following text into the MyKeywords Field:

the testing tesT2 and test3/test4;test5

The following terms will be present in the index:

Fields Text
MyKeywords testing
MyKeywords test

Different queries will return the following results:

MyKeywords:test
Result count: 1
 
MyKeywords:test*
Result count: 1
 
MyKeywords:testing
Result count: 1
 
MyKeywords:testING
Result count: 1
 
MyKeywords:test2
Result count: 1
 
MyKeywords:test4334
Result count: 1
 
MyKeywords:tester
Result count: 0
 
MyKeywords:*test3*
Result count: 0
 
MyKeywords:test3/test4;test5
Result count: 1
 
MyKeywords:"the testing tesT2 and test3/test4;test5"
Result count: 1

Using WhitespaceAnalyzer

    <Field name="MyKeywords" type="LongText">
      <DisplayName>MyKeywords</DisplayName>
      <Indexing>
        <Analyzer>Lucene.Net.Analysis.WhitespaceAnalyzer</Analyzer>
      </Indexing>
    </Field>

Enter the following text into the MyKeywords Field:

the testing tesT2 and test3/test4;test5

The following terms will be present in the index:

Fields Text
MyKeywords testing
MyKeywords tesT2
MyKeywords test3/test4;test5
MyKeywords the
MyKeywords and
MyKeywords:test
Result count: 0
 
MyKeywords:test*
Result count: 1
 
MyKeywords:testing
Result count: 1
 
MyKeywords:testING
Result count: 0
 
MyKeywords:test2
Result count: 0
 
MyKeywords:tesT2
Result count: 1
 
MyKeywords:test3
Result count: 0
 
MyKeywords:test3*
Result count: 1
 
MyKeywords:test3/test4;test5
Result count: 1
 
MyKeywords:the
Result count: 1

Is something missing? See something that needs fixing? Propose a change here.