Webmaster Tool: Find Sources of 404 Errors

Thanks to Matt Cutts for the heads up on how to use Google Webmaster Tools to get more information on our websites’ 404 File Not Found errors. As the Google Webmaster Central Blog announced: Webmaster Tools shows Crawl error sources.

For those who have not used the diagnostic tools of Webmaster Central, one of the informative features is getting a listing of the urls that resulted in 404 File Not Found errors. Previously, it was an exercise in futily, though, since we did not know where the incorrect link(s) originated. Now, that we know where the sources are, we have a better chance of correcting the links.

When I went to find out what caused my site’s 404 errors, it was pretty enlightening. I found out what appeared to be a hack attempt on my blog. I found some bad urls in the form of takethu.com/blog/page/NUM/?ref=BADSITE.COM. Fortunately, my blog was up-to-date so those urls didn’t do anything malicious nor contain anything bad on the pages. I checked Google’s cache to confirm that there was no spam. However, those results did show up in a Google site search of my blog so I needed to do something to get rid of them. This was what I added in my robots.txt to tell search engines to drop those urls from their indices:

Disallow: /blog/page/*/?ref=*

I love being able to use wildcards in robots.txt. Another nifty tool in Webmaster Tools is “Analyze robots.txt”, which enables testing of robots.txt disallow/allow patterns against actual urls to see if Googlebot will respond to the urls correctly.

Another thing I found was that there was a broken file path for a flash file on my site. Once I found out on what page it occurred, I was able to come up with a solution and fix it.

Thanks to the Google Webmaster Tools team for giving us webmasters such a useful tool.

2 thoughts on “Webmaster Tool: Find Sources of 404 Errors”

  1. Hi Thu – the final asterix isn’t needed in your disallow rule. Yay, saved another byte :-)) (I imagine the functionality is exactly the same though). Also, I’d suggest using the URL removal tool for those URLs; by using a “disallow” you’re just preventing Googlebot from recrawling the URLs, they can still be indexed.

  2. Hey John,

    Thanks for stopping by! I appreciate the tip about the asterisk. I had several of those so it would save more than a byte. :)

    I looked at the URL removal tool but it seemed like there was no efficient method for me to remove existing patterns as well as any that may get created in the future. If I could enter wildcards, that would be cool but it is not clear if that is possible.

    That’s an interesting point about the effects of disallow with regards to crawl and index. I used this robots.txt method in the past to get rid of duplicate patterns and it seemed to work, though slowly over time. Another thing is that using robots.txt would tell other search engines not to crawl the undesirable stuff. However, it’s good advice to anyone who may stop by and have a situation where the URL removal tool would be effective.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>