Help > Indexes > Indexing Rules > Spider Options

Spider Options

Navigation

Allow navigation above starting directory

Only applies where the starting URL specifies a starting location other than the root folder. For example:

www.mysite.com/level1/level2/start.html

When this option is selected Perceptive Enterprise Search is allowed to follow links to directory levels above the starting point. For example, to www.mysite.com/level1/mypage.html

When this option is deselected Perceptive Enterprise Search may only follow links at that level or lower. This can be very useful when you wish to index a portion of a site, but not the entire site, and the portion you wish to index is reflected in the site's directory structure.

Allow HTTP as well as HTTPS from same starting URL

When selected, a starting URL which begins with http:// can traverse a link to a URL starting with https:// and vice versa. When deselected, spidering is restricted to the protocol of the starting URL and will not traverse across to the other.

Consider URL case to be significant

When this is enabled Perceptive Enterprise Search normalizes (changes) all URLs that are retrieved to lowercase. This will enable Perceptive Enterprise Search Web Spider to do case-insensitive URL comparisons. This can be useful when you don't want duplicate URL names where they differ because of their case. It is important to note that by default URLs are case sensitive. You can take advantage of this when spidering Lotus Domino Servers where some Views may contain Notes documents with the same URL name, however, in a different case.

Traverse through related domains

When selected this option allows the spider to traverse across sub-domains of the starting URL. For example, if the starting URL was www.mysite.com/index.html then the URL www.denver.mysite.com/page.html is a related domain.

Connection

Number of connections

This controls the number of retrieval threads that Web Spider uses. In other words, how many page requests will be processed simultaneously. The default of 8 is suitable for most situations. Depending on the speed of your connection, more could increase the speed of spidering. For troubleshooting you may want to limit this to a single connection thread until you're satisfied that Perceptive Enterprise Search Web Spider is working correctly. Although very unusual, some web servers may block large numbers of connections from the same IP address, assuming it to be a Denial Of Service attack.

Retrieval timeout

If a requested page results in no reply at all, Perceptive Enterprise Search Web Spider will wait for the specified time (the 'timeout') until retrying (if set to do so above). If it fails after the final try (which may be the first if max retries = 0) it will log the failure.

Maximum retries

If a requested page results in a 'Page not found' or other similar errors, Perceptive Enterprise Search Web Spider can retry the request. The number set here is the maximum number of retries before giving up and logging a failure. (Default is 1.)

Pause between pages

You can also nominate that you want the Web Spider to pause between pages for a specified time. This may be desirable if you do not want the Web Spider to dominate your machine or network resources (or the web server is particularly slow) and are happy for traversal to take longer for the benefit of better resource sharing.

User Agent

User Agent

When a browser requests a page from a web server it supplies certain information to it in the form of 'headers', including the type and version of the browser making the request. Some web servers will tailor their responses based on these headers by supplying a page specifically for browsers with the capability to display certain elements, or with a certain level of scripting. However, this is more often used for logging information such as the browser agent, operating system and geographical location of requests.

There are several options available:

Proxy Server

Connect via proxy server

Select "Yes" if you need to connect via a proxy server from your machine to the remote site.

Proxy Server

Enter in the proxy server's IP address or URL in the input box. You will also need to provide the port number for the proxy.

Proxy Account

If your proxy server requires authentication, type your username and password in the fields provided.

Basic HTTP Authentication

The external site you wish to index may have security in place. If so, the external web server may reject Perceptive Enterprise Search Spider and a password request will be issued. Perceptive Enterprise Search will re-send the page request transmitting the Username and Password set here if challenged.

Custom POST Authentication

Some websites require you to input your details into a form to permit entry. For example, when logging into a web mail account. To utilize this functionality you will need to view the HTML source of the login page and identify the following fields from it:

The FORM's action (which will be an URL) to specify as the 'Post to URL'. This may be shown as a fully qualified URL such as http://www.mysite.com/login/formProcessor.php, or may be shown as a relative address such as formProcessor.php.

Relative addresses must be translated into the fully qualified address by adding the relevant domain name and directories, etc. For example, if the login page is at http://www.mysite.com/login/login.html and the form's action specifies formProcessor.php then the Post to URL should be set to http://www.mysite.com/login/formProcessor.php. Note that an action of /formProcessor.php (starting with a forward slash) is relative to the root of the domain, e.g.http://www.mysite.com/.

Each input field will need to be identified and included in the 'Post fields' in the format fieldname1=value1&fieldname2=value2&fieldname3=value3.

For example, a user called 'Ben' logs in to a website with four input fields (including the Submit button):
username=JohnSmith&password=notTelling&Group=coders&submit=Submit
The custom server authentication credentials are sent once before the first starting URL is retrieved.

Robots File

Respect Robot Exclusion Protocol

Many sites employ what is known as the Robot Exclusion Protocol to control what external robots (such as Perceptive Enterprise Search Web Spider) are allowed to index. This is implemented in a file called ROBOTS.TXT at the remote site. This is a gentleman's agreement between the site administrator and the robot operator as the remote site cannot really know whether a page is being accessed by a browser or an automated agent. When you select the Respect Robot Exclusion Protocol option Perceptive Enterprise Search will look for a file called ROBOTS.TXT at the remote site and follow the restriction guidelines therein.

Exclusions

Exclude by URL pattern

The excluded URL patterns specifies files (pages) to ignore by the indexer and is much more efficient than screening based on MIME types because Perceptive Enterprise Search knows not to request the file prior to retrieving it. This section only handles the filename and not the full URL path.

To exclude a URL pattern, create a new line in the text box and add the pattern to exclude. You can use one or many multiple file pattern wild-cards (*) in the file patterns lists.

Exclude by MIME Types

The excluded MIME types control the files (pages) retrieved by the indexer. They aren't as efficient as using URL patterns, but can be more accurate as a file may not have the correct extension and will bypass the URL pattern filter.

To exclude a MIME Type, create a new line in the text box and add the type to exclude.

Exclude by URL pattern, traverse links

The exclude URL patterns, traverse links filter is similar to the 'non traverse links' Exclude by URL pattern filter, with the difference that it performs a comparison using the full URL path and filename of the URL.

The pattern can include any number of '*' and '?' symbols. For example, to match MyPage.htm in the root directory use the pattern '/MyPage.htm'. To match it at any level use '*MyPage.htm'. To match all files in the 'MyDir' directory use '*/Mydir/*'. The pattern matching is not case sensitive.

Skip large files

You can configure the web spider to skip pages or files larger than a certain size using this option. Sometimes large files may be dumps, logs, archives, or other types of files that are not worthy of indexing. To skip large files set the option to Yes and enter a size limit to use (Default is 50 Mb)

Note: If you have access the HTML source of a website you can specify sections of a HTML file that the Indexer will skip over, to prevent common and insignificant content being indexed. This can be achieved by adding the following HTML comments <!-- ISYSINDEXINGOFF --> and <!-- ISYSINDEXINGON --> around the text to be excluded.

Logging

Level

Set the amount of data to be collected into the log file. The default is Errors only which reports only when things go wrong, such as files not found or authentication failed.

Log Filename

Specifies where the Spider will create a log file of its activities. The default is to create a file called isyspdr.log within the index folder.

Lotus Domino

De-duplicating

This option provides special functionality for Domino servers. Domino is designed heavily around the concept of views and it is normal for a Domino site to have many views of the same underlying documents.

This means that in the process of exploring the site the Spider will see the same documents many times, accessed via different paths and with different URLs. If you select the De-duplicate Lotus Domino Documents option, then the Spider will apply special processing logic such that it indexes each of the underlying documents once only. This is the recommended setting for Lotus Domino sites.

Document Timestamps

Timestamp Calculation

By default Perceptive Enterprise Search Web Spider will store the page date/time where available and otherwise compute a checksum over the entire page. Using this option you can configure Perceptive Enterprise Search to always utilize checksums for changed page detection, even when date/times are available.

If you are unsure how date/times work on a particular site just visit the site using a browser and display the page properties (In Internet Explorer, go to File > Properties). If the date/time fields are blank the server is not providing time information. If the fields are non-blank, but very recent (instantaneous, allowing for time zone variations), then the pages are being generated dynamically with a scratch date/time stamp.

Checksum Exclusion Area

These options let you specify regions of the page to exclude for checksum purposes. Such regions may contain variable advertisement information and so the page should not be considered as a changed page when portions within these regions change.

The portions to exclude from the checksum are indicated by start/end sequences and can occur many times within the page. If no such regions are found the checksum is computed over the entire page.

Document Deindexing

Deindex document

It is not necessarily straightforward to determine when pages should be removed from the index, as they may appear to disappear and reappear due to relinking. Pages may also be temporarily unavailable due to external server errors, in which case Perceptive Enterprise Search should not de-index all the previously seen pages.

Pages are therefore removed from the index when they have not been seen for a certain period of time, as configured in this section, or not at all.

Pages which still exist but which have been changed since they were last visited are detected immediately and processed in the next indexing batch. The deindexing control only applies to the removal of pages which have disappeared completely.