Tag Archives: seo

Reduce Duplicate Content in WordPress Archives with robots.txt

My WordPress blog url structure for posts is of this pattern: /blog/year/month/date/post-title/

Ignoring the fact that a post could also show up in /blog/ and /blog/category/ and other places, the same post could show up in:

/blog/2007/

/blog/2007/07/

/blog/2007/07/21/

/blog/2007/07/21/post-title/

The duplication is not as big of a concern (to me) as the possibly poor user experience when someone encounters one of the first 3 url patterns in a search engine results page. Presumably, they’re interested in reading a particular post. If they get a url that doesn’t go directly to the post, they’d have to scroll through the page to look for it, or use Find. Sometimes, results are irrelevant in those multi-post pages because the keywords are taken from posts that have nothing to do with each other.

I’ve been thinking of using robots.txt to block out indexing of the year, month, date archives, but wasn’t sure how to preserve the posts, since they also contain the same patterns in the url. Using Google’s Webmaster Tools robots.txt analysis tool, I played around with some patterns and tested the url patterns. I came up with the solution that preserves the posts while blocking the archives and the feeds. :)

In robots.txt, add these lines (this assumes you have a user-agent line already):

Allow: /blog/200*/*/*/*/
Disallow: /blog/200

Example results according to Webmaster Console:

URL Googlebot
http://takethu.com/blog/2007/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/ Blocked by line 15: Disallow: /blog/200
http://takethu.com/blog/2007/07/21/post-title/ Allowed by line 14: Allow: /blog/200*/*/*/*/

If you haven’t tried out Webmaster Tools, maybe you can see now how useful it is. I’m recommending it as a webmaster myself, and not because I work at Google. :)

Aside:

I’ve read that this particular pattern (y/m/d/p/) isn’t helpful (for ranking) since a search engine might not like that the content is a few directories deep. That’s too bad. I personally find it helpful to see the date in the url because it indicates how timely the content is. Also, it’s similar to a file structure on a computer. So if one were to go up a directory, there are other files from the same date and so on and so forth. We’re supposed to build sites for users, not search engines and although there are times when compromise is made to appease search engines in order to help them help users find content, this was something I didn’t want to compromise on. Also, this pattern is useful if I ever want to have the same post title at different times, without using a post id in the url, which isn’t as informative as the date structure.

Before applying these changes to your own site, please thoroughly test the patterns from your own site. I’m providing this information as a starting point so bloggers can have an idea of what patterns to use. The responsibility of proper implementation still rests upon the webmaster.

How to Minimize Duplicate Content in WordPress Blog

After reading this post at Google’s Webmaster Group, I was inspired to find out how to stop displaying full content on pages where posts were listed in categories, archives, etc. I am not so concerned about duplicate content as much as I don’t like for people to find a result in a category page but they have to look around the page to find it, or the post got moved to another page in the category and thus can’t find what they were looking for.

I didn’t know how to go about doing it so I searched. This page, Showing full posts on homepage, but snippets elsewhere, was a good start.

In short, the key is to edit the theme’s archive.php:

Change:

<?php the_content(); ?>

to:

<?php the_excerpt() ?>

A while back, I had copied the archive.php from the Default theme for another purpose, because Ocadia didn’t have its own version of the file. I ended up not using the file for anything important so I didn’t modify it much. Once I edited it to show snippets, or excerpts, on category and archive pages I noticed that it didn’t look like the homepage listings. For consistency, I copied the code between the divs for <div class="post"> in the theme’s index.php. With that change, pages for categories and archives no longer showed the posts in their entirety.

However, posts that were listed on “previous pages”, such as the one linked to at the bottom of the home page, continued to be a source of duplicate content. In the ocadia index.php, the code for search page is written to show excerpts. What I did was add a condition so that pages showed excerpts, too, like this:

<?php if (is_search() || is_paged()) { ?>

<?php the_excerpt() ?>

Now, if you go to the deeper pages of the index, category, etc, it shows post excerpts.

I also read recommendations to do the same to the home page. When I made the change to the home page, I did not find it aesthetically pleasing. The excerpted posts on the home page of the blog made it look like a splog that had scraped the content–not a good first impression for visitors. This was a situation where user experience trumped search engine optimization.

I just made these changes tonight. Time will tell if this will help or not.

How to Switch Title Tokens in WordPress

I wanted to switch my blog’s title tag to go from:

Thu Tu’s Blog > Post Title

to

Post Title | Thu Tu’s Blog

It turned out that it wasn’t as simple as switching the tokens around in the theme’s header.php file.

I did a search and found the solution here.

When I copied and pasted the example code, I got a parse error. It turned out that the blogger had changed the ' to `. If you want to copy and paste code, here’s what I’m using:

<title><?php wp_title(' '); ?><?php if(wp_title(' ', false)) { echo ' | '; } ?><?php bloginfo('name'); ?></title>

Since I use Google Custom Search Engine for my blog search, this would really improve the experience of searching for a post since the title will be the left-most element instead of the blog’s name.

Implemented Pretty Permalinks

After much internal debate, I finally decided to change my WordPress blog to use pretty permalinks. It was a difficult decision to make with lots of pros and cons to consider. I placed an asterisk next to cons that turned out to be irrelevant.

Cons to using or switching to pretty urls

  • I like the flexibility and quickness of accessing posts by their id numbers. I didn’t want to have to type out the whole url of a post when using pretty yet longer urls.*
  • Because of the above, I have been using ?p=num since the beginning. I feared breaking links to my blog entries.*
  • Changing the url structure so drastically could hurt rankings, and thus my blogging efforts will be for naught. ~*
  • My site stats won’t be as accurate because each post will have two different urls associated with it. I will lose historical data.
  • I tend to change my mind with the post titles so I didn’t want to commit to using urls that are dependent on the title.
  • (this wasn’t part of decision-making but realized after the fact) My custom search engine for my blog relied upon the parameters like ?p and ?m to differentiate the different types of results. I can’t do that anymore. I could add /post/ to the url pattern but that would make the urls even longer, resulting in higher likelihood of truncation. Bummer.

Pros to switching to pretty urls

  • My biggest pet peeve when looking at site stats is not knowing what entry is being referenced. All I see is /blog/?p=1. Using pretty permalinks will help me see which post I am seeing stats for.
  • Outside of stats, it helps me, visitors, search engines to see descriptive urls.
  • It turned out I can still access posts with their id numbers.
  • Because I can still access posts with id numbers, existing backlinks will still work.
  • I found and installed a permalink redirect plugin that will do a 301 redirect from the old urls to the new urls. This should help the search engines recognize the change in urls, thus reducing the likelihood of duplicate content issues.

I decided upon the pattern that uses /year/month/date/post_id/ instead of other recommended patterns such as /post_name/post_id/ or /category/post_name/. Since my posts tend to have multiple categories, I didn’t want to deal with the complications of using a url that involved the category. For me, the advantage of using the time-based url pattern is that navigating up the “directory” structure still showed posts for the day, month, or year. Another bonus, which I appreciate from sites that use this structure, is being able to see at a glance the age of a post. For certain topics, timeliness is a important to consider.

I will be keeping my eye on the results of implementing this change. I am going to watch for:

  • effects on traffic
  • effects on ranking
  • how long it takes for the new urls to replace the old ones in the index

Update: Less than an hour after implementation, google blogsearch of my blog already updated a bunch of the urls.

Update 6/04/2007: Two days** after implementation, a site search showed Google has indexed the new urls. I went through the pages of results to see if all urls had been re-indexed but the change stopped at around 50 results. The weird thing is that none of the posts were indexed with the new urls. The results showing the new urls were navigational links like months, categories, and feeds. Yahoo and Microsoft have not indexed the new urls.

Update 6/05/2007: Site search for takethu.com/blog/ still doesn’t show individual posts being indexed with the new url. However, if I do a sitesearch for takethu.com/blog/2007, for example, I can see the new urls for some posts.

My concern about losing traffic has been alleviated. My traffic has increased, and not only is at a record high for the month, but is at the second highest point ever since I started this site over 1.5 years ago.** The highest point was when people were searching about the tax due date this year and found my blog post, so that is an outlier that I would remove. Discounting that anomalous spike, this is the highest traffic level this domain has ever had in one day. Well, it’s only been three days since I switched to permalinks, but the results are promising.

Update 6/13/2007: After a week and a half, about 200 of the results in a site search show the updated urls.
** This was what I observed in my particular experience for my site. Your mileage may vary.

Five Reasons I Blog

I’ve been blog-tagged by Adam (tag) and Sebastian (tag) to write about why I blog.

First, let me clarify that there have been three incarnations of my blog. When I started in 2004, I used a phpbb forum to post “blog” entries. I did it for three months but then lost interest. In early 2006, I decided to try again but installed WordPress, instead of a forum application. I’ve been hooked since then. I don’t write regularly but at least I don’t think I will stop blogging the way I did in 2004. Also, some of my blog entries came from transferring journal-like entries from my photo gallery. That would explain the discrepancy in the dates of my archive if you were following along.

Although some of my core reasons for blogging have been the same since the first incarnation, some are no longer as relevant.

  1. I love to write. In high school, I was editor of the school newspaper. I also like to read the news so that’s something else I would blog about.
  2. I like to educate and inform and give tips when I can. I learn a lot from people on the internet and I feel that I can give back by sharing what I know. I use Google Analytics to see the keywords that visitors use to find my blog. When I see the queries, it further encourages me to blog, knowing that I am providing information that people are seeking.
  3. I do less of this now, but when I started my first blog, it was mainly to rant, though I call it Venting. It was a nice release to purge the thoughts I had in my head. Now that I’m happier with life, I don’t really feel the need to rant that much, or have something bother me so much that I will remember to write about it.
  4. The reason why I started to blog again was that I actually had more to write about. In the intervening time, I became a member of Coppermine dev team where I contributed code, got two cats, new gadgets, and a Wii. Plus, Google continues to release cool products/features.
  5. You know how if you have a significant other–especially one who lives with you–you can tell them random thoughts at random times? Well, I don’t have that at home. Thus my blog is a way for me to express myself whenever and however.

Now, it’s my turn to blog-tag. I’m going to go beyond the SEO world since I don’t know any other SEO who hadn’t already been tagged (or tagged me).

Joachim Mueller: (“gaugau.de – noch eine unnötige Webseite”; aka GauGau, Coppermine’s project manager)

Dr. Tarique Sani: (“Tarique’s Travails: Shades of Darkness”; Coppermine developer)

Rich Jhong: (“The taste of Ho Ho Puffs “)

I feel like I should post a photo of my cat(s) and me the way Matt did when he wrote about why he blogged. His cats helped inspire me to get my own.