Do you ever forget to do something that’s really simple? It’s easy to overlook some of the simple things when you’re worried about the more complex issue of Search Engine Optimization (SEO) or getting traffic to your website, etc. Robots.txt files fall into that category. Do you even have a robots.txt file on your site? It’s very simple and can help with your site’s ranking in the search engines a couple of ways.
What is a robots.txt file?
A robots.txt file is a small text file that you place in the root directory of your web site. You can list directories that robots (search engine spiders) should not visit. You can get specific if you’d like and specify different things for different robots (search engines) by targeting specific user-agents, but generally that’s not necessary.
Here’s a sample robots.txt file:
- Specify one subdirectory per line.
- The above example would stop robots from crawling the cgi-bin, print-friendly, and ~john directories.
- You can only have one robots.txt file and it has to be in the root directory of your site.
Other ways to do it:
There is also a META tag that has just about the same meaning.
Use this meta tag in the header of a page you don’t want crawled (indexed by a search engine).
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
Because all robots may not support or respect the robots.txt file or the META tag, your best bet is to use both.
You may be wondering exactly how this could affect SEO (Search Engine Optimization)? The biggest way is with duplicate content. Search engines do not want to find duplicate content. If you have a printer friendly version of your blog entries, then you want to stop the duplicate printer layout versions of your entries from being crawled. I have my templates and publishing parameters setup to put all the printer friendly pages in a specific folder and I list this folder in my robots.txt file. I also configured the publishing template for the printer friendly pages to use the META tag shown above. This stops most search engines from indexing the printer friendly versions of the pages and therefore eliminates a possible problem caused by having duplicate content.
The second way this helps is to stop the printer friendly pages from showing up in the search engine result pages at all. I want people finding my site by searching to land on the regular versions of my pages, not the printer friendly versions. My printer friendly template strips off the left and right columns and therefore removes most navigation. By only having the regular pages listed in the search engines the experience of a visitor to my site is better.
Of course there are other reasons to stop certain subdirectories from being crawled. You may have products such as eBooks, training videos, or scripts and test pages that you do not want showing up in the results of a search. Because the robots.txt file is not respected by every spider crawling around out there, you should always secure sensitive data in subdirectories that are password and username protected.
The robots.txt file is just a small, easy to create text file, but small things like this can add up to make a big difference.
Learn more about robots.txt files here: www.robotstxt.org/wc/robots.html.