Topics:
What is the robots.txt file?
Working with the robots.txt file
Advantages of robots.txt
Disadvantages of the robots.txt file
Optimization of the robots.txt file
Using the robots.txt file
What
is the robots.txt file?
The robots.txt file is an ASCII text file that has
specific instructions for search engine robots about
specific content that they are not allowed to index.
These instructions are the deciding factor of how
a search engine indexes your website's pages. The
universal address of the robots.txt file is: www.example.com/robots.txt
. This is the first file that a robot visits. It picks
up instructions for indexing the site content and
follows them. This file contains two text fields.
Lets study this example:
User-agent: *
Disallow:
The User-agent field is for specifying robot name
for which the access policy follows in the Disallow
field. Disallow field specifies URLs which the specified
robots have no access to. An example:
User-agent: *
Disallow: /
Here "*" means all robots and "/ " means all URLs.
This is read as, "No access for any search engine
to any URL". Since all URLs are preceded by "/ " so
it bans access to all URLs when nothing follows after
"/ ". If partial access has to be given, only the
banned URL is specified in the Disallow field. Lets
consider this example:
# Research access for Googlebot.
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /concepts/new/
Here we see that both the fields have been repeated.
Multiple commands can be given for different user
agents in different lines. The above commands mean
that all user agents are banned access to /concepts/new/
except Googlebot which has full access. Characters
following # are ignored up to the line termination
as they are considered to be comments.
Working
with the robots.txt file
1.
The robots.txt file is always named in all lowercase
(e.g. Robots.txt or robots.Txt is incorrect)
2.
Wildcards are not supported in both the fields. Only
* can be used in the User-agent fields' command syntax
because it is a special character denoting "all".
Googlebot is the only robot that supports some wildcard
file extensions.
Ref:
http://www.google.com/webmasters/remove.html
3.
The robots.txt file is an exclusion file meant for
search engine robot reference and not obligatory for
a website to function. An empty or absent file simply
means that all robots are welcome to index any part
of the website.
4.
Only one file can be maintained per domain.
5.
Website owners who do not have administrative rights
cannot sometimes make a robots.txt file. In such situations,
the Robots
Meta Tag can be configured to serve the
same purpose. Here we must keep in mind that lately,
questïons have been raised about robot behavior
regarding the Robot Meta Tag. Some robots might skip
it altogether. Protocol makes it obligatory for all
robots to start with the robots.txt thereby making
it the default starting point for all robots.
6.
Separate lines are required for specifying access
to different user agents and Disallow field should
not carry more than one command in a line in the robots.txt
file. There is no limit to the number of lines though
i.e. both the User-agent and Disallow fields can be
repeated with different commands any number of times.
Blank lines will also not work within a single record
set of both the commands.
7.
Use lower-case for all robots.txt file content. Please
also note that filenames on Unix systems are case
sensitive. Be careful about case sensitivity when
defining directory or files for Unix hosted domains.
Advantages
of the robots.txt file
1.
Protocol demands that all search engine robots start
with the robots.txt file. This is the default entry
point for robots if the file is present. Specific
instructions can be placed on this file to help index
your site on the web. Major search engines will nevër
violate the Standard for Robots Exclusion.
2.
The robots.txt file can be used to keep out unwanted
robots like email retrievers, image strippers etc.
3.
The robots.txt file can be used to specify the directories
on your server that you don't want robots to access
and/or index e.g. temporary, cgi, and private/back-end
directories.
4.
An absent robots.txt file could generate a 404 error
and redirect the robot to your default 404 error
page. Here it was noticed after careful research that
sites that do not have a robots.txt file present and
had a customized 404-error page, would serve
the same to the robots. The robot is bound to treat
it as the robots.txt file, which can confuse its indexing.
5.
The robots.txt file is used to direct select robots
to relevant pages to be indexed. This especially comes
in handy where the site has multilingual content or
where the robot is searching for only specific content.
6.
The need for the robots.txt file was also necessary
to stop robots from deluging servers with rapid-fire
requests or re-indexing the same files repeatedly.
If you have duplicate content on your site for any
reason, the same can be prevented from getting indexed.
This will help you avoid any duplicate content penalties.
Disadvantages
of the robots.txt file
Careless handling of directory and filenames can lead
hackers to snoop around your site by studying the
robots.txt file, as you sometimes may also list filenames
and directories that have classified content. This
is not a serious issue as deploying some effective
security checks to the content in question can take
care of it. For example, if you have your traffïc
log on your site on a URL such as www.example.com/stats/index.htm
which you do not want robots to index, then you would
have to add a command to your robots.txt file. As
an example:
User-agent: *
Disallow: /stats/
However, it is easy for a snooper to guess what you
are trying to hide and simply typing the URL www.example.com/stats
in his browser would enable access to the same. This
calls for one of the following remedies -
1.
Change file names:
- Change
the stats filename from index.htm to something different,
such as stats-new.htm so that your stats URL becomes
www.example.com/stats/stats-new.htm
- Place
a simple text file containing the text, "Sorry you
are not authorized to view this page", and save
it as index.htm in your /stats/directory.
This
way the snooper cannot guess your actual filename
and get to your banned content.
2.
Use login passwords:
- Password-protect
the sensitive content listed in your robots.txt
file.
Optimization
of the robots.txt file : -
1.
The right commands: Use correct commands. Most
common errors include - putting the command meant
for "User-agent" field in the "Disallow field" and
vice-versa.
- lease
note that there is no "Allow" command in the standard
robots.txt protocol. Content not blocked in the
"Disallow" field is considered allowed. Currently,
only two fields are recognized: "The User-agent
field" and the "Disallow field". Experts are considering
the addition of more robot recognizable commands
to make the robots.txt file more Webmaster and robot
friendly.
- Please
also note that Google is the only search engine,
which is experimenting with certain new robots.txt
commands. There are indications that Google recognizes
the "Allow" command. Please refer to: http://www.google.com/webmasters/remove.html.
2.
Bad Syntax: Do not put multiple file URLs in one
Disallow line in the robots.txt file. Use a new Disallow
line for every directory that you want to block access
to. Incorrect example :
User-agent: *
Disallow: /concepts/ /links/ /images/
Correct example:
User-agent: *
Disallow: /concepts/
Disallow: /links/
Disallow: /images/
3.
Files and directories: If a specific file has
to be disallowed, end it with the file extension and
without a forward slash at the end. Study the following
example :
For file:
User-agent: *
Disallow: /hilltop.html
For Directory:
User-agent: *
Disallow: /concepts/
Remember, if you have to block access to all files
in the directory, you don't have to specify each and
every file in robots.txt . You can simply block the
directory as shown above. Another common error
is leaving out the slashes altogether. This would
leave a very different message than intended.
4.
The right location: No robot will access a badly
placed robots.txt file. Make sure that the location
is www.example.com/robots.txt.
5.
Capitalization: Nevër capitalize your syntax
commands. Directory and filenames are case sensitive
in Unix platforms. The only capitals used per standard
are: "User-agent " and "Disallow "
6.
Correct Order: If you want to block access to
all but one or more than one robot, then the specific
ones should be mentioned first. Lets study this example:
User-agent: *
Disallow: /
User-agent: MSNBot
Disallow:
In the above case, MSNBot would simply leave the site
without indexing after reading the first command.
Correct syntax is:
User-agent: MSNBot
Disallow:
User-agent: *
Disallow: /
7.
Presence: Not having a robots.txt file at all
could generate a 404 error for search engine
robots, which could redirect the robot to the default
404-error page or your customized 404-error
page. If this happens seamlessly, it is up to the
robot to decide if the target file is a robots.txt
file or an html file. Typically it would not cause
many problems but you may not want to risk it. It's
always a better idea to put the standard robots.txt
file in the root directory, than not having it at
all.
The standard robots.txt file for allowing all robots
to index all pages is:
User-agent: *
Disallow:
8.
Using # carefully in the robots.txt file: Adding
comments after the syntax commands is not a good idea
using "#". Some robots might misinterpret the line
although it is acceptable as per the robots exclusion
standard. New lines are always preferred for comments.
Using
the robots.txt file
1.
Robots are configured to read text. Too much graphic
content could render your pages invisible to the search
engine. Use the robots.txt file to block irrelevant
and graphic-only content.
2.
Indiscriminate access to all files, it is believed,
can dilute relevance to your site content after being
indexed by robots. This could seriously affect your
site's ranking with search engines. Use the robots.txt
file to direct robots to content relevant to your
site's theme by blocking the irrelevant files or directories.
3.
The file can be used for multilingual websites to
direct robots to relevant content for relevant topics
for different languages. It ultimately helps the search
engines to present relevant results for specific languages.
It also helps the search engine in its advanced search
options where language is a variable.
4.
Some robots could cause severe server loading problems
by rapid firing too many requests at peak hours. This
could affect your business. By excluding some robots
that might be irrelevant to your site, in the robots.txt
file, this problem can be taken care of. It is really
not a good idea to let malevolent robots use up precious
bandwidth to harvest your emails, images etc.
5.
Use the robots.txt file to block out folders with
sensitive information, text content, demo areas or
content yet to be approved by your editors before
it goes live.
The robots.txt file is an effective tool to address
certain issues regarding website ranking. Used in
conjunction with other SEO strategies, it can significantly
enhance a website's presence on the net.
Related
Reading : -
A
Standard for Robots Exclusion
Guide
to The Robots Exclusion Protocol
W3C
Recommendations
Meta
Tags Optimization for Search Engines
About
The Author:
RedAlkemi Syndicate. RedAlkemi
is a leading Internet Marketing, eCommerce, Graphic
Design, Web & Software Development services company.
Experts at Redalkemi have about 20 years of experience
in the field of Graphic Design, Visual Communication
& Web Development. If you have comments; or would
like to have this article republished free on
your site, please contact syndicate@redalkemi.com.
All due credits must be carried and text, hyperlinks
and headers unaltered. © Copyright 2005, RedAlkemi.com