Help > Reference > Perceptive Enterprise Search Dialogs > Spider Options > Robots

Robots

View thumbnailView full size image

Respect Robots.txt file

Many sites employ what is known as the Robot Exclusion Protocol to control what external robots (such as Perceptive Search WebSite Spider) are allowed to index. This is implemented in a file called ROBOTS.TXT at the remote site, and is a gentleman's agreement between the site administrator and the robot operator, since the remote site cannot really know whether a page is being accessed by a browser or an automated agent. When you select the Respect Robot Exclusion Protocol option, Perceptive Search will look for a file called ROBOTS.TXT at the remote site and follow the restriction guidelines therein.

Use Google Sitemaps

Some sites list a sitemap under the site root in a file with the name Sitemap.xml. Web Spider can check for the domain of each starting URL whether this file exists and parse it for links. This can ensure that all the relevant links are found. Please note that links on the sitemap are rated as level one to the associated starting URL.

Local Robots Exclusion file

Allows you to specify a list of URLs that the Perceptive Search WebSite Spider will not traverse while indexing a web site. Enter a pattern that will match the URLs you wish to exclude from the index. For example, to exclude all files under "/sitemap/", enter */sitemap/*.