Google has recently published details to explain the term "crawl budget" and how it will affect Googlebot crawling your website. We will look into detail at the various concepts in this article, but it is useful to note that if you have less than a few thousand pages on your website, then it most likely should not be something you should be worried about.
Googlebot is the name given to Google's web crawling bot. It is a process whereby Google crawls or fetches the billions of pages on the web to index in their search engine. Googlebot uses an algorithm process that determines which websites to crawl, how often, as well as the number of web pages to fetch from each website. It is this algorithm that has spurned the concept of "crawl budget" as webmasters seek to optimize the number of pages crawled by Google. You can read more about how Googlebot works here.
In the recent announcement, Google stated that:
Prioritizing what to crawl, when, and how much resource the server hosting the website can allocate to crawling is more important for bigger websites, or those that auto-generate pages based on URL parameters, for example.
This makes perfect sense. A website that may have duplicate content issues, or millions of auto-generated pages may struggle to have all their pages indexed, or if they are indexed, the content may not often be updated.
Crawl rate limit
The crawl rate limit set by Google relates to the maximum number of simultaneous parallel connections Googlebot will use to crawl your website and the time it will wait before fetching another page. Because Googlebot will use your server resources Google limits the number of connections to your website to prevent your website slowing down and affecting the user experience for your visitors.
The crawl rate is affected by the following factors:
- Crawl health: When Google crawls the website it monitors its responsiveness. If the website responds fast, then the limits increase and more connections are used to crawl the website. If the website is slow, slows down once crawling starts or responds with server errors, the limit goes down, and Googlebot crawls the website less.
Google specifically has confirmed in the follow-up FAQs that website speed will increase the crawl rate, and conversely, a high number of errors will decrease the crawl rate.
- Limit set in Search Console: If you wish to reduce the rate Googlebot crawls your website, then you can set this within your Search Console. You cannot increase the rate of crawling by setting higher limits. You should note that Googlebot does not obey the "crawl-delay" directive in the robots.txt file and it must be implemented from within the search console.
You can improve the crawl rate limit by ensuring your server is as responsive as possible. One of the easiest ways to achieve this is to configure page caching by using W3 Total Cache, CloudFlare Page Rules or one of the many other solutions, and choosing a great host that utilizes RAM-based caching such as SiteGround or TMD Hosting.
While a higher crawl rate will help get all your pages indexed, or updated content re-indexed, Google has specifically confirmed that a higher crawl rate is not a ranking factor. That being said, we suspect the same actions you undertake for optimizing the crawl rate will also have search ranking benefits. For instance, by reducing the number of duplicate pages will help concentrate the page rank on the website. Fewer duplicate pages will also help prevent your website being marked as low-quality and help prevent a Panda penalty.
Googlebot can also reduce the number of pages it crawls even if the crawl limit is not reached. There appear to be many factors that influence the determination of crawl demand, but Google points out two of the main ones:
- Popularity: Web pages that are more popular on the web tend to be crawled more often. A page that is frequently shared on social media, or has many backlinks will be discovered more often as there are more links to that page from which the Googlebot can enter.
- Staleness: Google endeavors to prevent URLs from becoming stale in the index.
Google have also confirmed that significant changes to a website, such as a website move or URL structure change may trigger increased crawling.
To improve crawl demand, you can share your web page on social media, or frequently update the content on the page. Creating high-quality content that may attract backlinks will also help.
Factors affecting crawl budget
Google have analyzed various factors affecting the crawling budget, and subsequent indexing. Described as "low-value-add" URLs, they fall into the following categories which have been placed in order of significance (according to Google):
- Faceted navigation and session identifiers: This relates to the problem of duplicate content that may be created by URL parameters. An example of this can be seen below:
With URL parameters and session ids you can potentially have many thousands of virtually identical pages, each counting toward the crawl budget.
Not only that, the link juice, or page rank may be distributed across all these pages causing lower rankings among the pages on your website.
- On-site duplicate content: The same principle that applies to the preceding point is relevant to any duplicate content.
- Soft error pages: You can now see your soft errors from within your search console. Wasting your crawl budget on files or pages that result in an error is obviously bad. In some cases, you may find hundreds or even thousands of errors due to errors in your website configuration. An example of the search console page can be seen below, although fortunately in the following case there is just one error revealed:
- Hacked pages: One of the most common types of hacks is to either inject links into your pages (thus using your crawl budget by sending the bot away from your website) or by creating hidden pages on your website which then provide outbound links. A great article published by Google some time ago discusses this in more detail. However, an easy way to check your website is to enter your domain into the Sucuri website checker. This will check various pages on your website, and give you a result as follows:
- Infinite spaces and proxies: Imagine you have a page that contains a calendar. On that calendar, you can click on any day, or even to the previous or next month. Each of those links leads to a new page. In this situation, Googlebot will continuously find new pages as it jumps from month to month, and it is this type of situation that is referred to as an infinite space. You can read more about infinite spaces here.
- Low quality and spam content: Low quality and spam content have very little value for search engines, and in addition to risking a Panda Penalty, they also use up your crawl budget.
Google confirms that:
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a website.
Alternate URLs, and embedded content
It can be useful to use the nofollow directive on parts of your website that does not need to be indexed, such as shopping cart pages. While this won't necessarily stop them from being crawled if you have a dofollow link elsewhere on your website linking to them, it can be a great strategy to consider and look into should you have a crawl budget problem.
Logic would suggest that thinking about your crawl budget if you have under two thousand pages is a waste of time. Technically, this is correct, but we think that even small websites will benefit from reviewing all these issues from a page rank point of view. Having a well thought out website structure, with no duplicate content, and ensuring your page rank is directed to your most important pages can be very beneficial for SEO.