Ecommerce URLs & Parameters
While the optimization of URL structures (covered earlier) is helpful, another element to review are potential technical problems created by your URLs.
Though search engines have come a long way and no longer have issues with many URL structures (such as using underscores instead of dashes), parameters still pose a significant problem. Previously, search engines were largely confused by parameters; now search engines are able to crawl parameters which can lead to significant duplicate content problems, dilution of link equity, and wasting crawl bandwidth.
The four most common parameter types found on ecommerce sites are:
- Tracking parameters (analytics, path/session based, etc)
- Product variations
Google will attempt to figure out whether a parameter is significant to resolve the duplicate content problem, but even if Google gets this right (and this is often a big if) we can still run into problems with diluting link equity and wasting crawl bandwidth.
There are four ways to attack parameter problems – the robots.txt file, the canonical tag, the noindex tag, and updating parameter handling in webmaster tools – there is no perfect solution though.
The robots.txt file is the simplest solution and can often be the fastest solution to get implemented. You can use your robots.txt file to simply prevent search engines from crawling any parameters. To do this you have to identify the common parameter that you want keep from being indexed. Then simply add the following line of text to your robots.txt file:
There are a couple problems with this solution. The first is that adding a directive to the robots.txt file will not remove a URL from the index, it only prevents crawling. If you need to remove a URL from the index you should look into the noindex command, the canonical tag, or the Webmaster Tools URL Removal Tool. The second problem is that if you have any link equity associated with the parameter based URLs that you are linking to (such as if someone linked to a category page that has a filter applied or to a URL that has campaign tracking parameters) you will be orphaning this link equity.
If you are adding parameter based functionality to your site, you can effectively prevent pages from being indexed with the robots.txt command. Again, the downside is that any links generated to this pages will be lost and the link equity will not be associated with the page.
To employ the canonical tag, you will need to be able to modify the <head> section of every page – or you can insert the canonical tag in the http header of a page. It works by telling search engines what the canonical, or best, URL for the page and the content is.
If your parameter URL was /mens/shoes&color=black and this was causing a duplicate content problem with your shoes category page (/mens/shoes), we could add the canonical tag on the mens/shoes page to point to:
<link rel="canonical" href="http://store.com/mens/shoes " />
Depending on how your site is set up, you may be able to apply the canonical tag when the filter is applied to canonical to a more relevant category page than the parameter-less version of the URL (ex: /mens/shoes/black vs /mens/shoes)
Implementing this solution should allow you to remove the parameter based URLs from the index, solving duplicate content issues (though it does not do this 100% of the time). Further, this solution would help consolidate any link equity with the parameter based URLs onto the proper URLs.
As the canonical tag is able to concentrate link equity, it is typically one of the optimal solutions for duplicate content problems.
That said, it is not without its drawbacks. The canonical tag doesn’t help managing crawl bandwidth at all. If you have a huge site with a lot of parameter based pages you could find Googlebot spending a lot of time on the wrong pages with this solution. To mitigate the crawling you’ll want to update the parameter settings in Google webmaster tools to ignore the parameters. This should help reduce crawl resources being wasted on your parameter pages though this is only a suggestion for Googlebot, not a directive.
Meta Robots Noindex Tag
The meta robots tag is another solution that requires being able to modify the <head> section of your site. The sole function of this tag is directing search engines to refrain from indexing this page (or to remove this page from the index). To do this, simply add the following code to your <head> section of the page you want to prevent from being indexed:
<meta name=”robots” content=”noindex,follow”>
While this is a great tool to prevent a page from being indexed, it has several limitations. The first is that you may not be able to make the tag only impact parameters. Your site configuration will determine whether the tag can be applied to parameter specific URLs or only the base URL. For many configurations, this is not the right solution because you want the base page indexed. It can be the right answer if you have two different versions of a page and you only want one indexed.
As with the canonical tag, this tag won’t prevent search engines from crawling URLs.
X Robots HTTP Header Noindex Tag
The X Robots HTTP Header works the same as the meta robots noindex tag, but it is applied in the HTTP headers rather than the <head> section. This is useful when you can’t access the <head> section of a site.
You can learn more about how to apply the noindex in the http header here.
The final option for preventing parameter based URL problems is to change how Google handles URL parameters in Google Webmaster Tools. To change these settings, sign into webmaster tools, click Crawl, and then URL Parameters.
The default behavior for any parameter that Google has detected is “Let Googlebot Decide”. To change how Google treats these parameters you will need to click edit. From there you will see the screen below where it asks what the parameter does.
Select whether or not the parameter changes the content on the page. If you select that it does not change the content, simply click save. If the parameter does alter the content, you will have to tell Google what it does.
From there you can tell Google how to crawl URLs with the given parameter. If you’re having duplicate content problems as a result of parameters, you most likely want to select “No URLs”. This should prevent Google from crawling the URLs, much like the robots.txt directive.
- Determine whether parameters are causing duplicate content on your site
- Review the current application of the canonical/robots.txt/meta robots/webmaster tools
- Implement updates to the canonical/robots.txt/meta robots/webmaster tools if needed