Note: This post originally appeared on the Cheb.com.au design blog on the 25th Feb 2009- we have moved it to its new home here at Elastique.
After being in web design for so long it still amazes us that there are web-savvy developers (and designers) who don’t realise the importance of a robots.txt file. Whilst it isn’t the be all and end all of a web design strategy, a clean, correctly-formatted robots file can help search engine robots, affectionately known as bots to weed through the information on your site quicker; and more importantly, not weed through the places you don’t want them looking – for example, your ‘/cgi-bin’ folder or that folder where you store all your secret CIA-level documentation; ‘/c14-files’, naturally : )
In essence, the gist behind a robots.txt file is simple: Also known as the Robots Exclusion Protocol, or robots.txt protocol, the file is a standard, nothing-special text file which has a certain format to it and is used to prevent (willing) web spiders and other web robots from accessing all or part of a website which would otherwise be open to the general public.
A simple example
Say you’re running a competition site and you don’t want Google (or any other search engine, for that matter) to access the list of competitions you currently run which sits on a sub-directory called /comp-list/ [and bear with us on this one, it's just an example!]. All you’d have to do is not allow any bots/spiders to crawl that page, or more importantly, anything in that folder by disallowing access to it. The way you do that could not be simpler; all you do is create a robots.txt file, which will sit in your main root directory, with the following information:
User-agent: Name-of-bot-or spider, i.e. Googlebot Disallow: /path
In the above example, ‘User-agent’, simply dictates who it is you are targeting with this rule, whilst ‘Disallow’ is a special keyword which becomes an action of that target. So putting that all together, your robots.txt file should look something like this:
User-agent: Googlebot Disallow: /comp-list
Or, to make it so that no robots/spiders can crawl the folder, simply replace Googlebot, below with an asterisk, or “*” which is in programming terms a visual representation of “all”.
User-agent: * Disallow: /comp-list
That’s pretty much it! Well, if all you wanted to do was hide that page/folder that would pretty much be it. However, herein is the rub — What most people tend to forget is that, just because you stop Google or other engines from accessing your site, (and most common search engines inc. MSN Live, Google and Yahoo! will listen to what your robots.txt file says) – it does not stop people from opening your robots.txt file and finding our what you’re trying to hide from the world.
That’s pretty important to realise, especially when you are putting away invoices, or anything else to good use in a folder called ‘/secret-stash’. Google and other bots might not post it for the world to see, but it doesn’t mean anyone who sees your file can’t find it! So be weary of that when you are working on your robots file. The question that begs to be answered is why you’d put your invoices in a folder called ‘/secret-stash’, online in the first place, obviously : ).
Another important thing to remember is that robots.txt does not guarantee privacy! It works on the simple basis of search engine cooperation; having said that however, the big players in the search game at this stage do support it!
An intriguing scenario
One thing that never ceased to amaze me was how ‘thorough’ the Whitehouse’s official robots.txt file was- and we stress ‘was’. After eight years in power, President Bush and his web boffins had racked up nearly 2400 lines worth of search engine exclusions in the robots.txt file, which include:
User-agent: * Disallow: /cgi-bin Disallow: /search Disallow: /query.html Disallow: /omb/search Disallow: /omb/query.html Disallow: /expectmore/search Disallow: /expectmore/query.html Disallow: /results/search Disallow: /results/query.html Disallow: /earmarks/search Disallow: /earmarks/query.html Disallow: /help Disallow: /360pics/text Disallow: /911/911day/text Disallow: /911/heroes/text ...
And even more… Now some would say there’s a big conspiracy going on with political staff trying to hide information to the public about 9/11 or results, or most importantly, “expectmore” – which is sort of ironic, considering you are getting less information? No? Anyhow, We’ve recently taken a look at the new Obamafied version of the same document and it now looks a lot leaner; in fact, very much leaner… This much leaner:
User-agent: * Disallow: /includes/ Disallow: /search/ Disallow: /omb/search/
However, on a side note, the last two entries were recently added in addendum to ‘/includes/’. Interesting to say the least right? I mean it may seem like a stupid/childish topic to even get into, but there is a method to the madness of this discussion.
We wondered if this showed a changing nature of how the world- or at least ‘The Whitehouse’ is viewing the whole process of information gathering, retrieval and dissemination – or more importantly, perhaps trying to be more open? And then I found this article on BBC which went something along the lines of:
“Searching for data about the Obama administration should get easier as the Whitehouse.gov website gets overhauled.
Barack Obama’s new media team is letting search engines index almost everything on the site.
By contrast, after eight years of government the Bush administration was stopping huge swathes of data from being searchable.
The move is part of President Obama’s larger push to make the US government more open and transparent.“
And then it hit home! Maybe there really is a change coming? Point is, at least the team at Obama headquarters knows how these things work and believe me that’s a good start! Puts a whole new spin to ‘The Freedom of Information Act’.
Either way, maybe it’s an important strategy, or perhaps it’s just Obama’s team weeding out any last remains of the old administrations’ shortcomings? What it does show is that government, whether locally or overseas is realising the importance the web plays in our daily lives, especially in accessing information, and is taking every step they possibly can to rectify any mistakes which may have caused a roadblock of sorts in the past. And in anyone’s books, that can’t be a bad thing.
Read more about search engine optimization and robots.txt procedures here soon, or check out The Web Robots page for more information on robots.txt.