Reduce Duplicate Content in WordPress Archives with robots.txt

My WordPress blog url structure for posts is of this pattern: /blog/year/month/date/post-title/

Ignoring the fact that a post could also show up in /blog/ and /blog/category/ and other places, the same post could show up in:

/blog/2007/

/blog/2007/07/

/blog/2007/07/21/

/blog/2007/07/21/post-title/

The duplication is not as big of a concern (to me) as the possibly poor user experience when someone encounters one of the first 3 url patterns in a search engine results page. Presumably, they’re interested in reading a particular post. If they get a url that doesn’t go directly to the post, they’d have to scroll through the page to look for it, or use Find. Sometimes, results are irrelevant in those multi-post pages because the keywords are taken from posts that have nothing to do with each other.

I’ve been thinking of using robots.txt to block out indexing of the year, month, date archives, but wasn’t sure how to preserve the posts, since they also contain the same patterns in the url. Using Google’s Webmaster Tools robots.txt analysis tool, I played around with some patterns and tested the url patterns. I came up with the solution that preserves the posts while blocking the archives and the feeds. :)

In robots.txt, add these lines (this assumes you have a user-agent line already):

Allow: /blog/200*/*/*/*/
Disallow: /blog/200

Example results according to Webmaster Console:

URL Googlebot
http://takethu.com/blog/2007/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/post-title/ Allowed by line 14: Allow: /blog/200*/*/*/*/

If you haven’t tried out Webmaster Tools, maybe you can see now how useful it is. I’m recommending it as a webmaster myself, and not because I work at Google. :)

Aside:

I’ve read that this particular pattern (y/m/d/p/) isn’t helpful (for ranking) since a search engine might not like that the content is a few directories deep. That’s too bad. I personally find it helpful to see the date in the url because it indicates how timely the content is. Also, it’s similar to a file structure on a computer. So if one were to go up a directory, there are other files from the same date and so on and so forth. We’re supposed to build sites for users, not search engines and although there are times when compromise is made to appease search engines in order to help them help users find content, this was something I didn’t want to compromise on. Also, this pattern is useful if I ever want to have the same post title at different times, without using a post id in the url, which isn’t as informative as the date structure.

Before applying these changes to your own site, please thoroughly test the patterns from your own site. I’m providing this information as a starting point so bloggers can have an idea of what patterns to use. The responsibility of proper implementation still rests upon the webmaster.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>