Tag Archives: robots.txt

Webmaster Tool: Find Sources of 404 Errors

Thanks to Matt Cutts for the heads up on how to use Google Webmaster Tools to get more information on our websites’ 404 File Not Found errors. As the Google Webmaster Central Blog announced: Webmaster Tools shows Crawl error sources.

For those who have not used the diagnostic tools of Webmaster Central, one of the informative features is getting a listing of the urls that resulted in 404 File Not Found errors. Previously, it was an exercise in futily, though, since we did not know where the incorrect link(s) originated. Now, that we know where the sources are, we have a better chance of correcting the links.

When I went to find out what caused my site’s 404 errors, it was pretty enlightening. I found out what appeared to be a hack attempt on my blog. I found some bad urls in the form of takethu.com/blog/page/NUM/?ref=BADSITE.COM. Fortunately, my blog was up-to-date so those urls didn’t do anything malicious nor contain anything bad on the pages. I checked Google’s cache to confirm that there was no spam. However, those results did show up in a Google site search of my blog so I needed to do something to get rid of them. This was what I added in my robots.txt to tell search engines to drop those urls from their indices:

Disallow: /blog/page/*/?ref=*

I love being able to use wildcards in robots.txt. Another nifty tool in Webmaster Tools is “Analyze robots.txt”, which enables testing of robots.txt disallow/allow patterns against actual urls to see if Googlebot will respond to the urls correctly.

Another thing I found was that there was a broken file path for a flash file on my site. Once I found out on what page it occurred, I was able to come up with a solution and fix it.

Thanks to the Google Webmaster Tools team for giving us webmasters such a useful tool.

Reduce Duplicate Content in WordPress Archives with robots.txt

My WordPress blog url structure for posts is of this pattern: /blog/year/month/date/post-title/

Ignoring the fact that a post could also show up in /blog/ and /blog/category/ and other places, the same post could show up in:





The duplication is not as big of a concern (to me) as the possibly poor user experience when someone encounters one of the first 3 url patterns in a search engine results page. Presumably, they’re interested in reading a particular post. If they get a url that doesn’t go directly to the post, they’d have to scroll through the page to look for it, or use Find. Sometimes, results are irrelevant in those multi-post pages because the keywords are taken from posts that have nothing to do with each other.

I’ve been thinking of using robots.txt to block out indexing of the year, month, date archives, but wasn’t sure how to preserve the posts, since they also contain the same patterns in the url. Using Google’s Webmaster Tools robots.txt analysis tool, I played around with some patterns and tested the url patterns. I came up with the solution that preserves the posts while blocking the archives and the feeds. :)

In robots.txt, add these lines (this assumes you have a user-agent line already):

Allow: /blog/200*/*/*/*/
Disallow: /blog/200

Example results according to Webmaster Console:

URL Googlebot
http://takethu.com/blog/2007/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/post-title/ Allowed by line 14: Allow: /blog/200*/*/*/*/

If you haven’t tried out Webmaster Tools, maybe you can see now how useful it is. I’m recommending it as a webmaster myself, and not because I work at Google. :)


I’ve read that this particular pattern (y/m/d/p/) isn’t helpful (for ranking) since a search engine might not like that the content is a few directories deep. That’s too bad. I personally find it helpful to see the date in the url because it indicates how timely the content is. Also, it’s similar to a file structure on a computer. So if one were to go up a directory, there are other files from the same date and so on and so forth. We’re supposed to build sites for users, not search engines and although there are times when compromise is made to appease search engines in order to help them help users find content, this was something I didn’t want to compromise on. Also, this pattern is useful if I ever want to have the same post title at different times, without using a post id in the url, which isn’t as informative as the date structure.

Before applying these changes to your own site, please thoroughly test the patterns from your own site. I’m providing this information as a starting point so bloggers can have an idea of what patterns to use. The responsibility of proper implementation still rests upon the webmaster.

Woman Sues Archive.org

Colorado Woman Sues To Hold Web Crawlers To Contracts

The Internet Archive, archive.org, goes around the web and stores copies of websites indefinitely. This is a useful resource when the information is no longer available and you want to view a copy of it. It’s not so great for webmasters who want to have no evidence of old content, either to maintain privacy or avoid embarrassment, etc.

Personally, I can relate to how that woman feels about archive.org archiving my site contents, of which I felt embarrassed because I was a little silly when I started out on the web. Instead of suing them, though, I did what any reasonable webmaster would do. I added the following to that site’s robots.txt:

User-agent: ia_archiver
Disallow: /

Consequently, that site can’t be found in archive.org. You can’t see my Pooh bear-adorned website.

This woman’s lawsuit is ridiculous and frivolous. I hope it gets thrown out and she is forced to pay archive.org’s legal fees.

I got this story from slashdot, where I like to read the comments: http://yro.slashdot.org/article.pl?sid=07/03/17/1455214

Someone at slashdot found the site: profane-justice.org*. A whois lookup confirmed it is the same woman.

* This marks the first time that I have manually added nofollow code to my blog post. I initially wasn’t going to hyperlink it so she didn’t get credit, but then I didn’t want to inconvenience my visitors. I decided to hyperlink it but switch to html mode and added rel=”nofollow”.

Update: it looks like archive.org has indeed removed the site: http://web.archive.org/web/*/profane-justice.org/ is showing “Blocked Site Error.” FYI, if a site did the correct thing by setting up the robots file, the message would say: “Robots.txt Query Exclusion.”