Crawl settings SortSite Desktop Manual

These options affect how the crawler visits pages, and changing them can have a dramatic affect on the number of pages scanned. If crawls have stopped working you can restore factory settings using the Reset button in Scan Options.

You can block specific links using the Blocks tab in Scan Options and adding URLs to the Blocked links box. These can either be full URLs, or wildcard patterns:

  • https://www.example.com/copyright.htm blocks a single page
  • https://www.example.com/legal/* blocks all pages in the subdirectory called “legal”
  • http:* blocks all HTTP links
  • *.pdf blocks links to all pages with .PDF extension
  • *print_friendly.htm blocks links to URLs ending in print_friendly.htm
  • *action=edit* blocks links containing action=edit

Robots.txt

The robots.txt file is a digital “Keep Out” sign that limits access to specific pages, or blocks certain crawlers. We strongly advise keeping this option checked.

Including other domains in the scan

By default only pages from a single host name are scanned. This can be changed using the Links tab in Scan Options.

  • Follow links to related domains if unchecked only pages on a single host name are visited. If checked then peer host names and subdomains are also visited. For example, checking this box visits pages on www2.example.com and support.example.com if the start page is on www.example.com.

  • Follow links to additional domains add any additional domains you want visited during a scan (one per line).

Link depth can be changed using the Links tab in Scan Options.

  • Link depth controls how many clicks the scanner will follow from the start page. The default setting is your entire site, but you can restrict this to the top level pages of your site (e.g. visit up to 3 clicks from the start page).

  • External links controls how many links to each external sites are checked. You may want to restrict this if you have many links to a single site (e.g. social media sharing links).

Timeouts

You can change how quickly the crawler requests pages, and how long it waits for each page before timing out, using the Crawler tab in Scan Options.

  • Server Load lets you set a delay between loading pages to avoid placing undue load on a server
  • Page Timeout controls the maximum time spent loading a page - the default setting is 180 seconds

User agent

The User-Agent HTTP header can be set using the Crawler tab in Scan Options. This only needs changed if you’re scanning a site that does User-Agent detection.

HTTP authentication

If you need to scan sites using HTTP authentication you can enable this using the Crawler tab in Scan Options.