Crawling (also sometimes called “spidering”) is a common technique computers use to discover the content of a website. Major search engines like Google rely on crawling, as does Silktide.
Crawling is a simple process:
- Download a webpage
- Remember all the pages that webpage links to
- If you have pages you haven’t downloaded yet, repeat from #1
This is somewhat simplified, but it illustrates several important concepts:
You can only crawl pages that are linked to
If a page hasn’t been linked to from another page, there is no way for crawling to discover it. This is important in both Silktide and Google. For example, a web address that is written on a poster but never linked to elsewhere on your website is known as an ‘orphaned page’ and will never be crawled.
Orphaned pages can be tested in Silktide by manually adding the page URL(s) to a website report.
Crawling takes time
To crawl a website, a page must be downloaded to find new links, then follow those links and test any new pages… and so on, until all pages are found.
Most crawlers – including Google and Silktide – will download multiple pages at once to speed this process up, but it still takes time. If you try to download a website too quickly, you can put too much demand on the website and cause it to crash.
In order to prevent significant load on your website servers, Silktide limits the number of simultaneous connections to 6, equivalent to 6 regular website users browsing simultaneously.
Crawling can go on forever
Some websites might include so-called ‘spider traps’, which can cause a crawler to go on crawling forever.
A common example might be a calendar widget. Typically a calendar contains link to view the next day, and next day, and so on. These can continue until the year 300,000 AD and beyond. A crawler doesn’t understand that following these links makes no sense, and will continue to try to find the end of a series of URLs that can go on forever.
As a result, most crawlers have some built-in constraints to make them give up if they find too many pages. Often these constraints can also prevent ‘real’ pages that you want to test from being discovered.
Using custom rules, Silktide can be configured to ignore the URLs that lead to spider traps, while ensuring that the relevant pages you do want to test are included in your website reports.