"robots.txt" for WordPress

Robot exclusion is something that I don’t often think about, and I decided to focus on it today and clean up some of the search engine results my site will be generating, now that I have this shiny new wordpress log up.

I was looking at my robots.txt and trying to figure out what do disallow, and I came up with the following:

User-agent: *
disallow: /wordpress
disallow: /archive/category
disallow: /feed
disallow: /comments
disallow: /styles

/wordpress” is an obvious inclusion. I don’t want my log-in page cached.

I’ve taken a vow of permanence for certain links. Anything stored at /archive/ccyy/mm/dd/slug will remain until the end of time. Similarly, anything at the dd, and mm level should remain the same. I actually have nothing at the ccyy level. I hope to fix that at some point. Any other page I consider to have ever-changing primary content. This includes any feed links, and links to categories because posts may or may not remain in a particular category forever.

(note: I moved this post from general to weblog during revision. This would definately make it disappear from one page and appear on another)

Re-categorising posts means that “/archive/category” is too fluid to index by search engines properly. While it would be somewhat convenient, I’d hate to have someone come in for an article that has been re-categorised and move on because he or she did not find it.

/feed” is likewise verboten, unless someone knows something about search engines that I don’t. The special case “/comments” takes care of comment feeds as well.

I don’t want ”/styles” mirrored, because I get enough junk during a Google image search to pollute the service further with my interface images.

That concludes my explanation. I now open the floor to comments, suggestions, etc. about this configuration.