If you work on a large site that gets updated frequently, accidents are a big concern and can be hard to catch. Accidents like noindex tags or robots.txt “disallow: /” directives being copied from staging to production. There are a lot of indexation killers that are easy to overlook and can have a catastrophic impact on your organic traffic. While continuous deployment may be really cool for development and quickly releasing new features, it can also be a source of anxiety for SEOs.
The more release you do, the more people touching code, and the bigger your site, the more likely you are to have some of these accidents. At some point, they happen. The key is to do SEO quality assurance testing so that you can quickly identify the problem and fix it before it has a big impact on your traffic and revenue – hopefully before Google finds many of the accidents.
When you’re working on an enterprise site, the real question is: how do you find and fix accidents before they become problems. Due to the frequency of releases and competing demands for your time (you know, the other projects you’re working on to grow traffic and those fires you’re also putting out), it simply isn’t realistic to do a manual review of the site. Nor is it reliable, if you are manually review a lot of pages every release, you’re likely to make mistakes.
Why traditional crawlers aren’t a good solution for SEO QA
The logical next step for many SEO’s might be to use their crawling tool of choice for their quality assurance efforts. While this is generally a good solution for small sites and even some mid sized sites, there are some drawbacks for enterprise sites. Depending on the situation, there can even be drawbacks for sites that aren’t at the enterprise level but have other complexities that can throw a wrench in using a traditional crawler for SEO QA.
When you’re looking for accidents, you are typically comparing to a baseline. A noindex tag isn’t inherently good or bad, it is the change what is significant. Detecting true change can be challenging with a typical crawl tool due to their inherent design. The way most crawlers work is that they will start with a given page, usually your homepage, and then start adding additional URLs to their crawl list as they are discovered on the initial and subsequent pages. This means that you could see a spike in the number of noindex tags on your site, not because the noindex tag was applied to more pages but because of variance in URLs crawled this time versus last crawl.
A crawler can find a different set of URLs for a few different reasons such as A/B testing exposing different URLs, call to actions highlighting different URLs, or the crawler running out of resources – more on this below.
A decrease in “bad” directives isn’t always good – if you see your noindex tags or pages blocked by robots.txt dip, you need to investigate. Controlling crawling and indexation is a critical component of SEO. The noindex command exists for a reason. If the noindex got removed from pages, this could have a negative impact.
Most crawlers don’t handle enterprise sites well
As referenced above, the size and complexity can be a limiting factor for crawlers. Desktop based applications, such as Screaming Frog, work well for smaller sites, but I have a hard time crawling more than 150k URLs.For an enterprise site, this won’t cut it. Even if you run Screaming Frog in the cloud, it still can’t handle many enterprise sites in full. With SaaS based crawling platforms, there are two problems. First, some crawlers will hit a resources wall and not be able to handle your entire site. Alternatively, you can also run into a self imposed wall as most crawlers are have tiered pricing which limits how many URLs you can crawl. In any situation, enterprise sites will rarely be able to crawl all their URLs. As such, you are only looking at part of your site and you don’t know how much overlap there is with the previous crawl or how much change really exists.
Lack of organization
When you’re doing your SEO QA, you need to be able to identify problems and patterns quickly. This is difficult with most crawlers group output by a given metric; ex: here are all the URLs that are noindexed. As covered above, noindex isn’t necessarily a bad thing. So when you get a list of URLs that are noindexed, you have to go in and investigate which pages are impacted and if it is ok or a problem.
Big sites take a while to crawl. More URLs = more time required to crawl. It can take several hours just to crawl a couple hundred thousand URLs. If you’re waiting on a tool to crawl hundreds of thousand or millions of URLs before sending you a report, it’s going to take a while. With SEO quality assurance, you want to be fast – quickly identify problems and fix them before they adversely impact your rankings in Google
Poor signal to noise ratio
When you’re focused on QA, the truth is you don’t have time to fix every problem. You know you have missing alt text, duplicate meta descriptions, short page titles, long URLs, etc. These aren’t going to tank your organic revenue. When you’re doing QA, what you’re looking for are the fires that are going to cause big problems. I’m not saying deep dives and crawlers that do them don’t have their place, but when you’re validating a release, it decreases the signal to noise ratio in your data and takes more time to sift through.
Framework for a SEO quality assurance checklist
There are 3 important constructs to performing SEO QA on a release: what you check, reliable and fast data, and how you get your data.
SEO QA attributes
As referenced earlier, you don’t need to find every problem that’s wrong with your site when you’re reviewing a release. Save that for your site audit, this is about finding catastrophic problems. This in mind, you should be looking at the things that will have the biggest impact on your site’s organic performance. Things like checking to see if all your pages are blocked by robots.txt; or if all your onpage content disappeared. Here are the attributes that you should be checking to avoid an organic search meltdown:
- Meta robots noindex
- x-robots noindex
- blocked by robots.txt
- wrong or missing canonical tags
- on-page content
- internal links
- nofollowed internal links
- nofollowed external links
- hreflang tags
- http response codes
Quality data, fast
You need to get a consistent baseline, to cover all your bases (page types), and do it quickly. You don’t need to attempt to crawl every page each release. The best way to achieve this is with sampling. The big problems that are going to tank your SEO performance are going to happen not in one off instances but at the template level. This means you only need to check a relatively small number of URLs for each page type that you care about – such as 100 product detail page URLs.
This approach will ensure that you are comparing the same URLs every week to see if there is actually a change and it is really fast since you’re crawling a small percentage of your site.
Another benefit of creating structured groups is that the URLs are grouped by common characteristics so you don’t have to sort through a list of URLs with noindex tags to find a pattern. This makes your data analysis much faster. Below are some examples of groups that reflect different page types and expected behaviors of different pages.
|Ecommerce Pages||SaaS & Community Site Pages||Expected Behaviors|
|product detail pages||features pages||https pages|
|product category pages||customer support pages||pages that should 301|
|sale pages||content / resources pages||pages that should be noindexed|
|manufacturer pages||user profile pages||pages that should be blocked by robots.txt|
|key landing pages||user profile pages||404 pages|
|store location pages||search result pages||canonicaled pages|
How to do your SEO QA
There are a few different ways that you can use this sampling and grouping method for your SEO quality assurance efforts. The first is to work with your web development team to develop a series of scripts that they will use to check the metrics above. The limitation here is that a) the scripts must run manually, b) typically only the QA team will get the results and c) you won’t have a baseline. Another option you have is to create predetermined lists of URLs for your groups and you can crawl these URLs after every release. The drawbacks here are that a) it is very manual b) you won’t have much of a baseline c) it’s going to be a pain either re-running for each individual group or doing a bunch of analysis afterwards to segment into groups.
Right now there’ aren’t great options, but the best things you can do are:
- Focus on samplings and groups
- don’t get bogged down with metrics and checks that don’t matter for QA