Index Options

Allows you to set indexing options including number and date recognition, security, meta data and title handling.

Shared Index

Allow query users while index is being updated

Activate (or de-activate) this option by selecting the checkbox.

When you choose to allow the users to access the index at the same time as an index update, it will typically degrade performance of both the query and update processes by about 15%. However, this is offset by the benefit of updating the index without having to log off your query users before running the update.

Note: You must not change this option while there are any users using the index. Take the index offline before changing this setting.

Character Encoding

Use this option to select the language that documents to be indexed by Perceptive Enterprise Search are written in. Selecting Korean, Chinese, Hong Kong Chinese or Japanese enables multi-byte character support in Perceptive Enterprise Search. This will not change the language of the documents being indexed.

Unicode Support

Setting this option indexes all characters as Unicode, using the encoding of the document where available (Formats that specify encoding include Microsoft Office formats and Adobe Acrobat files). For documents that do not specify their encoding, such as text files, the encoding set in the option above will be used.

This allows a single index to contain documents from multiple languages, e.g. French, Chinese, Russian and English in a single index.

Setting this option will lead to larger index file.

You must perform a Reindex for any changes made to insignificant characters to take effect.

Language Pack

The language pack controls language specific settings for the index, such as common words on synonyms.

Language Detection

When this option is enabled, Perceptive Enterprise Search will try to determine the language of the document. It will inject a metadata property called ISYS_LANG that includes the two character language code. You can then use this value to search on document in a given language, or to refine a result list to a particular language.

Characters

Significant Characters

These are individual characters which convey meaning and are an important part of a word. Although only the numbers 0 through 9 appear in this field the letters A to Z, in both upper and lower case, and the international character set are also considered significant and are automatically placed in this category. You may want to specify other characters to suit your purposes. For instance, the dollar sign ($) if you need to search prices. As an example, if the hyphen was made significant then post-graduate would be indexed as post-graduate and a query for postgraduate would not return post-graduate and vice versa.

To add a new significant character enter it into the Significant field. To delete a significant character simply delete it from the field.

You must perform a Reindex for any changes made to significant characters to take effect.

Insignificant Characters

These are individual characters which are part of a word but are not regarded as important. Insignificant characters are treated as if they are invisible. For instance, if the hyphen is made insignificant then words which are hyphenated will be treated as one word, for example post-graduate would be indexed as postgraduate.

The default insignificant characters are the apostrophe ('), underscore(_) and single quote(' '). To add a new insignificant character enter it in the insignificant field. To delete an insignificant character simply delete it from the field.

You must perform a Reindex for any changes made to insignificant characters to take effect.

Intelligent Recognition

Intelligent recognition options allows Perceptive Enterprise Search to automatically identify parts of the text as either dates, numbers or entities.

Intelligent date Recognition

By selecting this option Perceptive Enterprise Search will be able to recognize a variety of date formats in queries and documents. Examples of valid dates are:

4 15 94 or 4-15-94 or 4/15/94
April 15th, 1994 or 15-apr-94
15 4 94 or 15-4-95 or 15/4/94
940415
15th day of April, 1994

Dates are located regardless of the form in which they are expressed in the query or in the document. For the purposes of proximity searching the date is considered to be a single hit, even if the date is actually expressed in three or four words. The exact location of the hit is taken to be that of the last component of the date. This is why only the final portion of the date sequence is highlighted in the Browse window.

Selecting this option will slightly increase the size of your indexes. Unless your data is mostly dates, the increase in index size should not be significant. Enabling the feature may also slightly reduce indexing performance.

If you intend to use the Intelligent Date Handling option on highly numeric data files, such as financial transactions that contain primarily numeric data intermixed with dates, and if you expect the dates to be in YY-MM-DD format or MM/DD/YY format, it is recommended that you configure your index with the "-"or "/" character defined as either significant or insignificant (see below). That is, do not take the default of "-" being interpreted as a word delimiter (punctuation character). This will greatly assist Perceptive Enterprise Search in making the correct interpretation of the dates when they appear in highly numeric data.

Where a date is ambiguous, for example 1-4-94, Perceptive Enterprise Search resolves the day/month ambiguity according to the regional settings that were in effect when you started Perceptive Enterprise Search (i.e. the Regional options under Windows Control Panel).

Note that any ambiguity in the indexed data is resolved at the time the data is indexed, whereas ambiguities in a query are resolved at the time of the query. This means that you can, for example, index your source data with ambiguities resolved according to your REGIONAL settings, then ship the index to another user running a different REGIONAL setting. Perceptive Enterprise Search automatically normalizes and remaps the ambiguities accordingly.

Intelligent number Recognition

This option works in similar way to the Intelligent Date recognition option but for numbers only. For example, the number 1,029 could be expressed in any of the following ways:

1029
1,029
one thousand and twenty nine
one thousand, twenty nine
1,029.00
1 029
one zero two nine
a thousand and twenty nine

Perceptive Enterprise Search will interpret each of the above examples as the same number.

The Index dots when embedded in words or numbers option will be automatically enabled when Intelligent Number recognition is used.

Detect Entities in documents

When this option is selected, Perceptive Enterprise Search will identify the "who, what and where" entities involved in documents. The entities Perceptive Enterprise Search recognizes include:

People
Organizations
Locations
Email addresses
Website domain names

It does this automatically using a combination of heuristic and dictionary based techniques.

When queries are performed Perceptive Enterprise Search will find all the documents which match your search terms. As well as displaying the found documents, it will also provide an outline of the main entities involved in the cluster of found documents. This can be used to identify the key players related to your search term.

You can then drill-down by clicking on one of the entities. The presence of an entity may even suggest a whole new line of inquiry. Sometimes the entity itself will constitute the answer for which you are looking (for example, what company does John Smith work for).

Perceptive Enterprise Search Entity detection works very well without any sort of user configuration. However, because it is being done by a computer rather than a person, it will occasionally mis-categorize entities. For example, "John Street". In general, Perceptive Enterprise Search tries to avoid making judgments which should require human knowledge to make properly.

Other times, Perceptive Enterprise Search may fail to recognize entities about which you care greatly. For example, you may be involved with organizations whose names are not being reliably detected by Perceptive Enterprise Search, or dealing with individuals with unusual names. In these scenarios, you can augment the standard Perceptive Enterprise Search lexicon to include your own local knowledge.

Additional Information

Index document meta data

Choosing this option will allow Perceptive Enterprise Search to index the summary information in formats that support it. These formats include Microsoft Word, Excel, WordPerfect, Acrobat PDF and HTML. This information can then be searched on and will appear at the top of each document when browsed. This information can also be used in Field Level Searching: see Named Sections in the Query help.

Cache document meta data

Store a copy of the metadata information in the index for faster retrieval and processing. Allows for metadata to be used for document categories. Recommended if Index document metadata is enabled.

Index annotations

If selected Perceptive Enterprise Search will automatically detect and index any text note annotations that have been created for documents in the index.

Because this option affects which files will and will not be indexed (much like an indexing rule), rather than affecting how previously indexed documents should be read (like Intelligent Date Handling or Smart Dot Handling), you can change it without Reindexing the index. Changes will be reflected in the next Update run just like a rule change.

Note that only the text annotations are indexed - not hyperactivities, images or linked queries.

Index filenames as well as contents

If you choose to index filenames you can then search for files by their names, extensions or any portion thereof. For example:

TEST.DOC
TEST
TE*
MYDIR \\ ASC

Perceptive Enterprise Search indexes each portion of the filename as though it were a word located at the beginning of the document.

Other

Index embedded dots in words

When this option is checked, dots occurring in the middle of a string of characters are not treated as word separators. Dots are considered significant in cases like 3.2.12 but not significant at the end or start of a word.

Compensate for OCR and typographic errors

Also known as Fuzzy Pre-compensation.

The use of Fuzzy Pre-compensation is appropriate when you know your source data is likely to have a high proportion of errors as a result of being captured by means of Optical Character Recognition (OCR) scanning.

When Fuzzy pre-compensation is chosen, Perceptive Enterprise Search queries will automatically and transparently retrieve words that it considers may be OCR scanning errors or other typographical errors. For example, searches for "duck" will also retrieve "cluck", as it is possible that the "d" was slightly broken and misread as a "cl" by the OCR process.

Perceptive Enterprise Search uses sophisticated heuristic, algorithmic and statistical means to determine which words are likely errors of other words. In some cases, Perceptive Enterprise Search may incorrectly suggest that one word is a misspelling of another. There will always be some degree of 'false alarms'.

It is recommended you inform your users when an index has been configured with the fuzzy pre-compensation feature enabled. This should avoid any possible confusion as to why additional words are being returned during their query activities.

Note that use of this feature will slightly increase your index size and indexing time.

Do not index pure numbers

This option treats pure numbers or words which are made exclusively of numerals (e.g. 765390) as common words. This means these numbers or words will not be indexed, and thus not be searchable. Note that this option has no effect on alphanumeric strings, e.g. A1234B, which are always indexed.

Note that changing this option will require you to perform a Reindex.

Create spelling tips

Creates a list of similar words for the index. This allows for 'Did you mean?' prompts for queries.

Cache file security information

Cache the NTFS file security descriptor at indexing time. This dramatically increases the speed at which result lists can be filtered to show only documents that are accessible to the current user.

The cached security information will be updated when documents are updated in the index. To force an update of the cache use Refresh Security. This can also be updated via a Scheduled Update task.

Note that some files may be shown in the result list that the user no longer has access to, though the document itself will not be able to be viewed.

De-duplicate documents

Indicates that the documents should be de-duplicated at query time. De-duplication is based on the text content of the document, it is possible that Perceptive Enterprise Search may recognize the same document being stored in two different formats.

To fully enable this option, go to WebSites --> Default --> Search Settings --> Search Defaults and perform the following steps:

Set Sort to anything other than Natural Order.
Set Remove duplicate documents from the result list to Yes.

Content Quality

Indicates the quality of the documents within this index. This value is used when calculating document relevance across multiple indexes. Documents from an index rated as high quality should appear higher in the results than documents in an index rated as low quality.

Note: this value is used as a factor in the relevance algorithm and does not guarantee that document will appear higher.

Document Titles

Use title from document metadata when available

Use title from document metadata or summary information if available.
If Perceptive Enterprise Search finds metadata in your indexed documents it will use the title included in this metadata as the document title.

This option requires metadata indexing to be enabled.

Otherwise use text from the document using the following rule:

On or after line - Perceptive Enterprise Search will use the first non-blank line of the document after the line number you specify as the document title.

That contains - Perceptive Enterprise Search will use the first line of the document that contains text you specify as the document title.

You can combine the two parameters if required. For example if the string Re: appears somewhere after line number 4, then enter 4 into the After line number field and Re: into the That contains field.

You must perform a Reindex for any changes made to Document titles to take effect.

Automatic Index Backups

Option to automatically keep multi-generational backups of indexes. Backups are always placed in a sub-folder of the current index location called ISYSIndexBackups with individual folders for each version.

The index is backed up every time the index is modified. This can result in multiple backups for a single index update run.

Deferred Deindexing

Sometimes the slowest part of an index update can be removing references to changed or deleted documents. Perceptive Enterprise Search lets you defer this work until later. Then Perceptive Enterprise Search only 'marks' those documents as no longer existing, or no longer existing in their old form. The benefit is your index updates may go much faster where changed or deleted documents are involved. The deferred documents remain in the index taking up space, but will never be found by your queries.

Documents marked for deferred deindexing are automatically purged when the amount of space used by them reaches about 20% of your total index size. This way you never have to worry about deferred deindexing consuming large amounts of space (the actual proportion is determined by many factors including the number of words and documents in the index).

Additionally you may manually purge deferred deleted items at any time using the Purge Deleted option.

Default File Options

These options determine how indexing of documents occur. These settings apply for all email attachments, SQL BLOBS, Lotus Notes attachments and other non-file system objects. The options can be overridden for File and FTP indexing rules.

Document Options

Plain
Selecting this option from the drop-down list will use the default document options for the indexing rule.
Contains Ventura desk-top publishing paste-up markers
If you use the Ventura Publisher desktop publishing package and the documents covered by this rule may contain Ventura paste-up markers, select this option from the drop-down list. Perceptive Enterprise Search will then correctly interpret the special markers that Ventura inserts into documents, wherever they may occur.
Every line has hard return. Two hard returns indicate paragraph ends.
Use this option if your documents are in a word-processor format in which every line ends with a hard return and every paragraph ends with two consecutive hard returns.
Entire document is double-spaced. Three hard returns indicate paragraph ends.
Use this option if your documents are in a word-processor format, are completely double-spaced and every paragraph ends with three consecutive hard returns.

The preceding two options only apply to documents in a word processor format where there is a concept of a hard-return. ASCII, by comparison, doesn't have the concept of paragraphs and so it is normal for every line to end with a hard return.
Document contains lines formatted wider than 78 characters
By default the browser automatically wraps wide documents to a right margin of 76 characters for maximum browsing performance. Select this option if you need to browse highly formatted, wide ASCII material, such as 132-column mainframe report files. You can also compensate for wide documents by using the Quality presentation option or by amending the ISYS.CFG file.
Documents should be interpreted as OEM (DOS ASCII) character set
In most situations, with English-style languages, there is no real difference between ASCII files (as usually generated by DOS programs) and ANSI files (as usually generated by Windows programs). By default Perceptive Enterprise Search assumes ASCII files have been created using the ASCII character set. If you have ASCII files that are actually coded in ANSI, and if this distinction matters in your language, use the "ANSI" option when creating your indexing rules.

Presentation Options

High Definition (HD) HTML - Near pixel perfect paginated viewing renditions. Lowest indexing speed. Maximum viewing fidelity.
Select this option if you want your documents to appear in the Browser almost exactly as they would in your word processor or native application, including pagination and optional thumbnails. This option may result in slower indexing and document browsing. Query performance is not affected.
Standard HTML - Maintain majority of original formatting. Good indexing speed. Good viewing fidelity.
Select this option if you want your documents to appear in the Browser with the majority of formatting preserved. This option may result in slower indexing and document browsing. Query performance is not affected.
Standard - Document text and properties only, maximum speed
Select this option to achieve the fastest indexing and browsing speeds possible within your environment. When browsing a document Perceptive Enterprise Search will make use of its own internal viewers which will allow for multiple hit colors.
Use index presentation setting
This is the default option, indexing presentation is controlled by the Default File Options, set in the Index Options dialog.

Document Caching

Document caching stores the textual content of each document within the index. This gives greater performance when displaying the document content on the website. It is most useful when using the {CONTEXT} tag, or for complex document formats such as PDF.

Note: This option approximately doubles the size of your index.

Extended Options

This section allows you to set any desired extended options to be applied during indexing.
Contains a semi-colon (;) separated list of names and values in the following format:
- <name>=<value>;
For example you can set the number of index worker processes that get created during indexing by adding the option indexworkers with a value of how many workers to use on top of the indexer.
- Syntax: INDEXWORKERS=0;
- By default (or if not set) the number of workers will be equal to the number of cores minus 2 (with a minimum of 1), but can be overridden by setting the value here.
- A value of 0 means that there are no workers created and that the indexing will be done within a single process.