<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Sydney Web Design, Development, SEO &#38; Web Marketing Blog &#124; Elastique Web Design Blog Sydney &#187; search engine</title>
	<atom:link href="http://elastique.com.au/web-design-blog/tag/search-engine/feed/" rel="self" type="application/rss+xml" />
	<link>http://elastique.com.au/web-design-blog</link>
	<description>Sydney Web Design, Development, SEO &#38; Web Marketing Blog &#124; Elastique Web Design Blog Sydney</description>
	<lastBuildDate>Tue, 24 Feb 2009 15:24:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Robots.txt file: The good, the bad, The Whitehouse</title>
		<link>http://elastique.com.au/web-design-blog/robotstxt-file-the-good-bad-the-whitehouse/</link>
		<comments>http://elastique.com.au/web-design-blog/robotstxt-file-the-good-bad-the-whitehouse/#comments</comments>
		<pubDate>Tue, 24 Feb 2009 15:18:22 +0000</pubDate>
		<dc:creator>Elastique</dc:creator>
				<category><![CDATA[Interesting]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Obama]]></category>
		<category><![CDATA[Privacy]]></category>
		<category><![CDATA[Robots]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[Search engine optimization]]></category>
		<category><![CDATA[Search engine ranking]]></category>
		<category><![CDATA[Whitehouse]]></category>

		<guid isPermaLink="false">http://elastique.com.au/web-design-blog/?p=315</guid>
		<description><![CDATA[After being in web design for so long it still amazes us that there are web-savvy developers (and designers) who don&#8217;t realise the importance of a robots.txt file.  Whilst it isn&#8217;t the be all and end all of a web design strategy, a clean, correctly-formatted robots file can help search engine robots, affectionately known as [...]]]></description>
			<content:encoded><![CDATA[<p>After being in web design for so long it still amazes us that there are web-savvy developers (and designers) who don&#8217;t realise the importance of a robots.txt file.  Whilst it isn&#8217;t the be all and end all of a web design strategy, a clean, correctly-formatted robots file can help search engine robots, affectionately known as <em>bots</em> to weed through the information on your site quicker; and more importantly, <strong>not</strong> weed through the places you don&#8217;t want them looking &#8211; for example, your &#8216;/cgi-bin&#8217; folder or that folder where you store all your secret CIA-level documentation; &#8216;/c14-files&#8217;, naturally : )</p>
<p>In essence, the gist behind a robots.txt file is simple: Also known as the Robots Exclusion Protocol, or robots.txt protocol, the file is a standard, nothing-special text file which has a certain format to it and is used to prevent (willing) web spiders and other web robots from accessing all or part of a website which would otherwise be open to the general public.<span id="more-315"></span></p>
<h2>A simple example</h2>
<p>Say you&#8217;re running a competition site and you don&#8217;t want Google (or any other search engine, for that matter) to access the list of competitions you currently run which sits on a sub-directory called /comp-list/ <em>[and bear with us on this one, it's just an example!]</em>. All you&#8217;d have to do is not allow any bots/spiders to <em>crawl</em> that page, or more importantly, anything in that folder by <em><strong>disallowing</strong></em> access to it. The way you do that could not be simpler; all you do is create a robots.txt file, which will sit in your main root directory, with the following information:</p>
<pre>User-agent: Name-of-bot-or spider, i.e. Googlebot
Disallow: /path</pre>
<p>In the above example, &#8216;User-agent&#8217;, simply dictates <em>who</em> it is you are targeting with this rule, whilst &#8216;Disallow&#8217; is a special keyword which becomes an action of that target. So putting that all together, your robots.txt file should look something like this:</p>
<pre>User-agent: Googlebot
Disallow: /comp-list</pre>
<p>Or, to make it so that <em><strong>no </strong></em>robots/spiders can crawl the folder, simply replace Googlebot, below with an asterisk, or &#8220;*&#8221; which is in programming terms a visual representation of &#8220;all&#8221;.</p>
<pre>User-agent: *
Disallow: /comp-list</pre>
<p>That&#8217;s pretty much it! Well, if all you wanted to do was hide that page/folder that would pretty much be it. However, herein is the rub &#8212; What most people tend to forget is that, just because you stop Google or other engines from accessing your site, (and most common search engines inc. MSN Live, Google and Yahoo! <em>will</em> listen to what your robots.txt file says) &#8211; it does <em>not</em> stop people from opening your robots.txt file and finding our what you&#8217;re trying to hide from the world.</p>
<p>That&#8217;s pretty important to realise, especially when you are putting away invoices, or anything else to good use in a folder called &#8216;/secret-stash&#8217;. Google and other bots might not post it for the world to see, but it doesn&#8217;t mean anyone who sees your file can&#8217;t find it! So be weary of that when you are working on your robots file. The question that begs to be answered is why you&#8217;d put your invoices in a folder called &#8216;/secret-stash&#8217;, <em>online</em> in the first place, obviously : ).</p>
<p>Another important thing to remember is that robots.txt does <em>not</em> guarantee privacy! It works on the simple basis of search engine cooperation; having said that however, the big players in the search game at this stage do support it!</p>
<h2>An intriguing scenario</h2>
<p>One thing that never ceased to amaze me was how &#8216;thorough&#8217; the Whitehouse&#8217;s official robots.txt file was- and we stress &#8216;was&#8217;. After eight years in power, President Bush and his web boffins had racked up nearly 2400 lines worth of search engine exclusions in the robots.txt file, which include:</p>
<pre>User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /omb/search
Disallow: /omb/query.html
Disallow: /expectmore/search
Disallow: /expectmore/query.html
Disallow: /results/search
Disallow: /results/query.html
Disallow: /earmarks/search
Disallow: /earmarks/query.html
Disallow: /help
Disallow: /360pics/text
Disallow: /911/911day/text
Disallow: /911/heroes/text
...</pre>
<p>And even more&#8230; Now some would say there&#8217;s a big conspiracy going on with political staff trying to hide information to the public about 9/11 or results, or most importantly, &#8220;expectmore&#8221; &#8211; which is sort of ironic, considering you are getting less information? No? Anyhow, We&#8217;ve recently taken a look at the new Obama<em>fied</em> version of the same document and it now looks a lot leaner; in fact, very much leaner&#8230; This much leaner:</p>
<pre>User-agent: *
Disallow: /includes/
Disallow: /search/
Disallow: /omb/search/</pre>
<p>However, on a side note, the last two entries were recently added in addendum to &#8216;/includes/&#8217;. Interesting to say the least right? I mean it may seem like a stupid/childish topic to even get into, but there is a method to the madness of this discussion.</p>
<p>We wondered if this showed a changing nature of how the world- or at least &#8216;The Whitehouse&#8217; is viewing the whole process of information gathering, retrieval and dissemination &#8211;  or more importantly, perhaps trying to be more open? And then I found this <a title="BBC" href="http://news.bbc.co.uk/1/hi/technology/7844280.stm">article on BBC</a> which went something along the lines of:</p>
<blockquote><p><em><span style="color: #1d8fe1;">&#8220;Searching for data about the Obama administration should get easier as the Whitehouse.gov website gets overhauled.<br />
Barack Obama&#8217;s new media team is letting search engines index almost everything on the site.<br />
By contrast, after eight years of government the Bush administration was stopping huge swathes of data from being searchable.<br />
<strong>The move is part of President Obama&#8217;s larger push to make the US government more open and transparent.</strong>&#8220;</span></em></p></blockquote>
<p>And then it hit home! Maybe there really is a change coming? Point is, at least the team at Obama headquarters knows how these things work and believe me that&#8217;s a good start!  Puts a whole new spin to &#8216;The Freedom of Information Act&#8217;.</p>
<p>Either way, maybe it&#8217;s an important strategy, or perhaps it&#8217;s just Obama&#8217;s team weeding out any last remains of the old administrations&#8217; <em>shortcomings? </em>What it does show is that government, whether locally or overseas is realising the importance the web plays in our daily lives, especially in accessing information, and is taking every step they possibly can to rectify any mistakes which may have caused a roadblock of sorts in the past. And in anyone&#8217;s books, that can&#8217;t be a bad thing.</p>
<p>Read more about<a title="Search engine optimization: Getting the process right from scratch" href="http://www.cheb.com.au/search-engine-optimization-get-the-process-right-from-scratch-part-1/"> search engine optimization and robots.txt</a> procedures or check out <a title="The Web Robots" href="http://www.robotstxt.org">The Web Robots</a> page for more information on robots.txt.</p>
]]></content:encoded>
			<wfw:commentRss>http://elastique.com.au/web-design-blog/robotstxt-file-the-good-bad-the-whitehouse/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

