How many times have you asked yourself (or have they asked you) if a website is well indexed on Google? Read this guide, in the end you will be able to answer with numbers and data in hand.
In this guide I show you the process I follow to evaluate the indexing status of a website and some reports that are useful for me to calculate when I start an SEO audit .
Index:
- Indexable pages of the website
- Budget crawl
- Number of pages indexed by Google
- Number of pages inserted in sitemap.xml
- Crawl ratio
- Index ratio
- Sitemap ratio
- Rendering ratio
- Quality ratio
- Conclusions
Premises
First of all, I clarify an important but often misunderstood point: being indexed does NOT mean being on the front page .
An indexed web page means that it is present in the Google index, regardless of its ranking ( positioning in search results ). Indexing is the second step in the process that a webpage follows before receiving organic traffic:
- Scan ( guide )> 2. Indexing( guide )> 3. Positioning ( guide ).
Today I’m going to show you a practical method for evaluating the indexing status of a website on Google . The same method can also be used for Bing, Yahoo and Yandex.
To continue you need to find some data that you can get from a scan with Screaming Frog ( guide ) and from Google Search Console – GSC ( guide ).
In detail, the data that interest us are:
- Total number of pages given to search engines, i.e. total pages of the website that must (or should) be indexed.
- Crawl budget, i.e. the number of pages crawled by Googlebot daily.
- Number of pages indexed by Google.
- Number of pages inserted in sitemap.xml.
Let’s see how to find this information.
Indexable pages of the website
To find this value, you need to crawl the website with a crawler such as Screaming Frog or Visual SEO.
The goal is to find all the pages that Google would index on the website , to find them run a crawl and use the HTML report in Screaming Frog. Counts pages with status code 200, canonical and with meta robot index (implicit, no need to declare it).
The pages must be canonical of themselves. The canonicalized pages are not indexed, therefore they should not be counted.
Noindex pages or pages with an error status code (4xx, 5xx) or redirection (3xx) should be ignored for the purposes of this guide, as they would not be indexed.
Crawl budget – number of pages crawled by Googlebot daily
The number of pages crawled every day by Googlebot is a very interesting value because it allows us to understand the reputation of the website in the eyes of the search engine.
Without going into issues already dealt with, I can summarize that:
- few scans:site not up to date, not very authoritative
- many scans:site updated frequently and with good online authority
For more information, I recommend the guide to increase the Crawl Budget .
The value can be found in GSC> Crawl> Crawl Stats. In the new GSC the report is under Legacy tools and reports> Crawl stats . In Italian, however, you can find it in Previous tools and reports> Scan statistics .
Refer to the average daily value if the trend is stable, if instead the trend is growing or declining you can use the average of the last 15 days.
To obtain more precise data, I recommend an analysis of the web server log . For large sites it takes time to process the log, but you get richer and more precise data than the approximation of the data provided by GSC.
Number of pages indexed by Google
The number of indexed pages can be found in two ways, each with strengths and weaknesses:
- Through the site search operator:– approximate value updated frequently and changes according to the datacenter queried
- Via GSC– more accurate but not frequently updated value
To use the search operator go to Google and typesite:www.tuosito.com. Be careful to enter the correct subdomain: if the site uses www you must also use it in the operator, otherwise you would see the indexing data of all subdomains (if any).
The value can be found in GSC> Google Index> Index Status.
In the new GSC the report is called Coverage or Coverage in Italian. The value we are interested in is the sum of the values indicated under Valid and Valid with warnings .
Number of pages inserted in sitemap.xml
To find out how many URLs are passed to Google via the sitemap.xml you can use GSC> Crawl> Sitemaps. Take note of the value of web pages submitted .
In GSC new the section dedicated to sitemap.xml is linked directly from the menu in the left sidebar, under the Index / Index group.URLs submitted to GSC via sitemap.xml
For the count of the indexable URLs present in sitemap.xml you can also use the Coverage report and filter with the drop-down only the URLs sent by the sitemap.xml.
Alternatively you can insert the sitemap.xml in Screaming Frog (list mode), start the scan and see how many pages with status code 200, index and canonical are found.
Now that you have collected all the data, it is time to calculate some indexes to evaluate the indexing status of the website.
Crawl ratio – crawl ratio
This report indicates Google’s interest in crawling pages on a website. The higher the value, the greater the Googlebot’s interest in looking for updates in content.
A good ratio is greater than 80% . Minor reports indicate that the site is uninteresting in the eyes of the crawler, possibly because it is rarely updated or because it has low online authority.
Larger reports are positive , indicating strong Googlebot interest.
Index ratio – indexing ratio
This report indicates indexing status , which is the percentage of website pages that Google indexes. The perfect ratio is obviously 100% but you will rarely find such a precise value (personally this has never happened to me).
A good ratio is between 80% and 120% . Minor reports indicate that the site is poorly indexed, possibly due to low-value or copied content, wrong canonical tags , spam, or a thousand other reasons.
Larger ratios are more complex to evaluate and can mean several things:
- Google has indexed old files, such as .pdf, still on the FTP server but no longer linked internally to the site
- the site has duplicate indexed content, for example www version and not-www version or due to incorrect parameters or relational canonicals
- the scan with Screaming Frog or Visual SEO was done badly, the site has more pages than detected. Check your Screaming Frog settings, the crawler setup must be specific to your website (do you use nofollow links? Do you use links in JavaScript?)
Sitemap ratio – indexing ratio of sitemap.xml
The perfect ratio is 100% but, even in this case, it rarely happens. A good ratio is between 90% and 110%.
Minor reports indicate that the sitemap.xml is incomplete. Compare the crawl data of the site with the sitemap and find the differences.
Larger ratios may indicate that:
- the crawl done on the website is wrong, in reality the pages of the site are more
- the sitemap.xml includes URLs it shouldn’t, such as noindex pages. Scan the sitemap.xml with Screaming Frog and compare the data with the site crawl data and you will find the excess URLs
Rendering ratio
The perfect value is clearly 100% but with a framework-based site in JS you can rarely get perfect results. A good value is between 80% and 100% .
As you know JavaScript content is not indexed immediately after being scanned, it must be rendered first. It is useful to calculate the render ratio to understand how many site resources are viewed correctly by Google.
First, find a phrase that is always present in the HTML-only version of your pages. You could take the VAT number or other elements that you know are clear in the HTML. Then, find a sentence of JavaScript-dependent content, which is therefore only present in the rendered version of your pages. This text should not be present in the initial page response from the server, but rather only present after rendering. Now run two queries to get the indexing values:
- Use the “site:” search operator + the HTML-only phrase. This will show you how many total pages Google has indexed.
- Use the search operator “site:” + the phrase present in the pages with JavaScript. This will show you how many total pages Google has rendered.
By dividing the number of rendered pages by the number of indexed pages, you will get the rendering ratio.
Quality ratioThe Google Search Console “Coverage” report
The Quality Ratio is the ratio that measures the quality of crawling by comparing the pages that Google decides to include in the index with those it excludes.
For the calculation of the included pages we have to add up the valid pages and the valid pages with warnings, since both page categories end up in the index.
For the value of the excluded pages we add the pages in error with the excluded pages.
The result will be a percentage value ranging from 0% to 100%. What value or range of values can be considered good? From the crawl budget point of view, it is a good idea to avoid wasting Googlebot time, don’t give it useless pages to crawl.
In my opinion a good range for the Quality ratio is between 35% and 100% , the closer it is to 100% the better.
Conclusions
With these reports it is possible to evaluate the indexing status of a website and identify critical issues that may affect good SEO.
I hope this method can help you during the analysis you will face, if you have any questions leave a comment .