Spider Rules
An index of a website contains a set of rules that instruct
Perceptive Enterprise Search what to index, and how to navigate the website. Spider rules can be one
of the following:
-
Starting URL
A starting URL is the launch point where Perceptive Enterprise Search will begin crawling. Perceptive Enterprise Search will crawl
this address and look for links to follow. A website index can contain multiple starting
URLs.
-
Exclusion Filter
Exclusion filters allow you to excluded certain URLs from being indexed.
-
External Domains
Perceptive Enterprise Search will only index web pages that reside on the same domain as the starting URL. This option
allows you to specify external domains that Perceptive Enterprise Search is allowed to crawl. When Perceptive Enterprise Search encounters a link
to an external domain it will only follow the link if the domain is specified here.
Adding a Starting URL
-
Select Spider Rules for the index in which you wish to add a Spider Rule. If
Spider Rules is not visible this is most likely due to the index being configured as a "File System" index.
You cannot add Spider Rules to a File System index.
-
Click New Starting URL. The Perceptive Enterprise Search Spider URL Wizard will appear.
-
Enter the URL for the website you wish to index. The URL should be formatted like
http://www.lexmark.com. Click Next.
-
Select the crawl depth for the website. The crawl depth indicates how many links from the
starting URL to follow. A link on the starting URL is considered to be level 0. A document linked to
from the starting URL is level 1, and so on. Click Next.
-
If a site map for this site is found, you will be given the option to have the Perceptive Enterprise Search Spider crawl the website
to the depth specified in the previous step or to use the site map. If you choose the site map option all
URLs listed in the site map will be indexed (this will set the crawl depth in the spider rule to -2).
Select an option and click Next.
-
Review the information. If you wish to make any changes click Back. Once you are satisfied
with the settings click Finish to add the new Starting URL.
Adding an Exclusion Filter
-
Select Spider Rules for the index in which you wish to add an exclusion filter. If
Spider Rules are not visible this is most likely due to the index being a configured as a "File System" index.
You cannot add Spider Rules to a File System index.
-
Click New Exclusion Filter. The Perceptive Enterprise Search Spider Filter Wizard will appear.
-
Enter a pattern that will match the URLs you wish to exclude from the index. For example,
to exclude all files under "/sitemap/", enter */sitemap/*. Click Next.
-
Review the information. If you wish to make any changes click Back. Once you are satisfied
with the settings click Finish to add the new Exclusion Filter.
Adding an External Domain
- Select Spider Rules for the index in which you wish to add an external domain. If
Spider Rules are not visible this is most likely due to the index being configured as a "File System" index.
You cannot add Spider Rules to a File System index.
-
Enter the external domain which contains web pages that should also be crawled and indexed. Click Next.
-
Select the crawl depth for the external domain. The crawl depth indicates how many
links from the external link Perceptive Enterprise Search will follow. Click Next.
-
Review the information. If you wish to make any changes click Back. Once you are satisfied
with the settings click Finish to add the new Exclusion Filter.
Note: If you have access to the HTML source of a website you can control
the content the spider indexes within a HTML file. The spider will index content from the top of the page until
the first <!-- ISYSINDEXINGOFF --> HTML comment, and only start again if it finds a <!--
ISYSINDEXINGON --> comment. It can be switched off and on as many times as needed, and doesn't affect
the page's display. In this way sections of the page can be excluded from indexing, and therefore will never be
found as a hit on that page. Prime candidates are headers, footers, navigation panels, adverts, etc., which
appear on every page.