Tag Archives: webmaster

Webmaster Tool: Find Sources of 404 Errors

Thanks to Matt Cutts for the heads up on how to use Google Webmaster Tools to get more information on our websites’ 404 File Not Found errors. As the Google Webmaster Central Blog announced: Webmaster Tools shows Crawl error sources.

For those who have not used the diagnostic tools of Webmaster Central, one of the informative features is getting a listing of the urls that resulted in 404 File Not Found errors. Previously, it was an exercise in futily, though, since we did not know where the incorrect link(s) originated. Now, that we know where the sources are, we have a better chance of correcting the links.

When I went to find out what caused my site’s 404 errors, it was pretty enlightening. I found out what appeared to be a hack attempt on my blog. I found some bad urls in the form of takethu.com/blog/page/NUM/?ref=BADSITE.COM. Fortunately, my blog was up-to-date so those urls didn’t do anything malicious nor contain anything bad on the pages. I checked Google’s cache to confirm that there was no spam. However, those results did show up in a Google site search of my blog so I needed to do something to get rid of them. This was what I added in my robots.txt to tell search engines to drop those urls from their indices:

Disallow: /blog/page/*/?ref=*

I love being able to use wildcards in robots.txt. Another nifty tool in Webmaster Tools is “Analyze robots.txt”, which enables testing of robots.txt disallow/allow patterns against actual urls to see if Googlebot will respond to the urls correctly.

Another thing I found was that there was a broken file path for a flash file on my site. Once I found out on what page it occurred, I was able to come up with a solution and fix it.

Thanks to the Google Webmaster Tools team for giving us webmasters such a useful tool.

Reduce Duplicate Content in WordPress Archives with robots.txt

My WordPress blog url structure for posts is of this pattern: /blog/year/month/date/post-title/

Ignoring the fact that a post could also show up in /blog/ and /blog/category/ and other places, the same post could show up in:





The duplication is not as big of a concern (to me) as the possibly poor user experience when someone encounters one of the first 3 url patterns in a search engine results page. Presumably, they’re interested in reading a particular post. If they get a url that doesn’t go directly to the post, they’d have to scroll through the page to look for it, or use Find. Sometimes, results are irrelevant in those multi-post pages because the keywords are taken from posts that have nothing to do with each other.

I’ve been thinking of using robots.txt to block out indexing of the year, month, date archives, but wasn’t sure how to preserve the posts, since they also contain the same patterns in the url. Using Google’s Webmaster Tools robots.txt analysis tool, I played around with some patterns and tested the url patterns. I came up with the solution that preserves the posts while blocking the archives and the feeds. :)

In robots.txt, add these lines (this assumes you have a user-agent line already):

Allow: /blog/200*/*/*/*/
Disallow: /blog/200

Example results according to Webmaster Console:

URL Googlebot
http://takethu.com/blog/2007/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/post-title/ Allowed by line 14: Allow: /blog/200*/*/*/*/

If you haven’t tried out Webmaster Tools, maybe you can see now how useful it is. I’m recommending it as a webmaster myself, and not because I work at Google. :)


I’ve read that this particular pattern (y/m/d/p/) isn’t helpful (for ranking) since a search engine might not like that the content is a few directories deep. That’s too bad. I personally find it helpful to see the date in the url because it indicates how timely the content is. Also, it’s similar to a file structure on a computer. So if one were to go up a directory, there are other files from the same date and so on and so forth. We’re supposed to build sites for users, not search engines and although there are times when compromise is made to appease search engines in order to help them help users find content, this was something I didn’t want to compromise on. Also, this pattern is useful if I ever want to have the same post title at different times, without using a post id in the url, which isn’t as informative as the date structure.

Before applying these changes to your own site, please thoroughly test the patterns from your own site. I’m providing this information as a starting point so bloggers can have an idea of what patterns to use. The responsibility of proper implementation still rests upon the webmaster.

Webmasters: Check Site’s Keywords in Stats

One of the things I am most interested in when reviewing my site’s statistics is to see the keywords/queries that helped people find my site. It encourages me as a blogger because it shows that people are looking for topics that I am writing about, and they are able to find it accordingly.

A bonus effect of checking a site’s keyword stats is to be aware of hidden issues with the site. Recently, I upgraded my WordPress blog version to 2.2. It turned out that this affected one of my plugins, one to denote private posts. I wasn’t able to see it while logged in as admin to my blog. In the stats, I saw that there was a query for some error message. I wondered why my blog would show up for that error. I did a search and found the page that showed the error message to Googlebot.

It turned out that WordPress 2.2 made that plugin obsolete because it now denoted Private posts. However, the codes conflicted. I couldn’t see the error because as admin, I saw the post. However, a visitor who is not logged in, like Google bot, would not see the private post, and see errors caused by the plugin. I checked the post while logged out and saw the error. I deactivated the plugin and the problem went away. Now it’s a matter of time for gBot to update the cache.

Another good reason to check a site’s keyword stats is to see if there are unusual results. Some people whose sites get hacked don’t know about it until they review their stats. Even then, some think Google is broken because it is sending off-topic traffic, such as for porn or pharmaceutical queries. However, by replicating the search, it’s possible to find that the content does exist on the site. There might not be corresponding files, though, because the hacker could have used .htaccess to dynamically generate content.