Why can page limits prevent PDFs from testing?

Page and PDF Allowance

Your contract with Silktide gives you a certain number of Web pages and PDF documents that can be downloaded during testing. You will have a Page allowance and a PDF allowance to split across your sites. While the purpose of these two figures seems obvious, there is a nuance regarding how pages are classified during tests that it is important to be aware of.

The combined Page and PDF allowances state the maximum number of Web pages and PDF documents our crawler can scan. Once the limit for either is reached, this is where the crawler will stop trying to discover more documents of that type.

You can set the allowances for each under the site level settings. To learn how to adjust your Page and PDF allowance, refer to the following support article.

https://help.silktide.com/en/articles/9067052-adjusting-your-page-and-pdf-allowances

When testing your site, any URL with a .pdf document included will automatically use the PDF allowance if the URL leads to a valid document.

Why PDFs can be missed

It can be the case that even though you have spare documents remaining from your PDF allowance, Silktide is unable to find more PDFs when you have met the Page allowance for the site. This can happen when:

the PDF document URL does not include the .pdf extension
the URL uses a 301 redirect to the actual PDF document from a different source URL

Demonstrations

Let’s go through an example of each of the scenarios mentioned above to understand the outcomes. For both examples, let’s say you have a small site with the following numbers of Pages and PDFs hosted.

Pages: 50
PDFs: 10

You may have set the allowances to match the exact figures.

URL does not contain .pdf

Our crawler scans the full 50 Pages and your PDF documents are using a URL format that does not include the .pdf extension.

For example, say your PDF is hosted at https://www.example.com/docs/sample-1 and that URL is dynamically rewritten to return the actual PDF document at something like https://www.example.com/media/pdfs/media-strategy.pdf.

Here the Web request has been intercepted, with the destination URL being rewritten in the background. This is perceivable by the URL in the browser address bar staying as https://…/sample-1 when navigating to the page on the internet.

URL uses a HTTP 301 redirect

Our crawler scans the full 50 Pages and your PDF documents are using a URL that is redirecting to the PDF document.

For example, say your PDF is hosted at https://www.example.com/docs/sample-2 and this URL is configured to redirect (HTTP 301) to the actual document at something like https://www.example.com/media/pdfs/seasonal-offering.pdf.

Here the Web request for https://…sample-2 has been accepted, with another request being created automatically to fetch the document from the other URL. This is perceivable by the URL in the browser address bar changing to https://…/seasonal-offering.pdf when navigating to the page on the internet.

Result

Both URL schemes leave out the .pdf extension, but both use a different mechanism to deliver the PDF document. Silktide now treats https://…/sample-1 and ttps://…/sample-2 as Pages and not as PDFs. At this point the Page allowance of 50 has been reached, meaning the URLs are not scanned and the PDFs are not added to the Inventory.

Solution

The way to work with this is to either increase the Page allowance for the site, or to ensure that URL schemes are updated to include .pdf.