NOTE: Old text is marked through using <strike> </strike> which may not be supported by all browsers. If this is the case, then I will remove this feature.
This is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anyone, and there are no guarantees that all current and future robots will use it. These are just proposed extentions to the current robot exclusion standard.
The latest version of this document can be found at http://www.conman.org/people/spc/robots2.html .
This document proposes such extensions to the
robot exclusion standard. The original standard is referred to as
Version 1.0.0 Version 1.0, while the extentions
proposed here
are referred to as Version 2.0.0 Version
2.0 of the robot exclusion standard.
More information about robots in general can be found on the
Robots Exclustion World Wide Web Robots, Wanderers, and Spiders page.
The following is taken verbatim from the original robot exclusion standard. That standard coversVersion 1.0.0Version 1.0 of the robot exclusion standard, while this coversVersion 2.0.0Version 2.0 of the robot exclusion standard. The method hasn't changed betweenVersion 1.0.0Version 1.0 andVersion 2.0.0Version 2.0 though.
The method used to exclude robots from a server is to
create a file on the server which specifies an access
policy for robots.
This file must be accessible via HTTP on the local URL
"/robots.txt".
The contents of this file are specified below.
This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.
A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.
The choice of the URL was motivated by several criteria:
Any text following a "#" up to the end of a line is
to be ignored. The "#" character can appear
at any portion only after a blank (or
whitespace) character, or at the start of a line. Some
examples:
# this is a comment line
#so is this
User-agent: fredsbot # this is another comment
Disallow: * # we don't like this bot
The general match is included for compatibility with
Version 1.0.0 Version 1.0
of the robots exclusion
standard. General matches do not contain regular
expression characters, but are treated as if they contain the
character "*", which is used to match zero or more
characters, at the end of the string. An example would be:
/helpme, which is to be treated as:
/helpme*.
This exists solely for Version 1.0.0
Version 1.0
compatibility and their usage can be determined by context.
This contains no regular expression characters and any string to be matched using an explicit string match must match all the characters present exactly.
This is a regular expression that is compatible with that used
in Perl, a popular language used on the web that contains
support for regular expression matching.
This is a regular expression that is compatible with that used by /bin/sh, a shell found on all implementations of UNIX. Within the pattern string:
See the directive Robot-version for more details.
This is in the format of HH:MM, where HH is in 24 hour time (inclusive between 00 and 23) and MM is minutes (inclusive between 00 and 59).
The time is specified in UT (or GMT) time.
<rate> is defined as <numdocuments> '/' <timeunit>. The default time unit is the second. See the directive Request-rate for more information.
An example of some rates:
10/60 - no more than 10 documents per 60 secs
10/10m - no more than 10 documents per 10 mins
20/1h - no more than 20 documents per hour
where <data> depend upon the directive and items in
"[" and "]" are optional.
Unless otherwise noted, each directive can appear more than once in a given
rule set. The following directives are defined for Version
2.0.0 Version 2.0
User-agent: *
Disallow: /
See the original robot exclusion standard for more information.
The first number indicates major revisions to the
robots.txt standard. The second number represents
clarifications or fixes to the robots.txt standard.
The first part is the major version number of the
robots.txt standard. Only drastic changes to the
standard shall cause this number to be increased.
Valid numbers for this part are 1 and 2.
The second part is for partial upgrades, clarifications or small added extensions. My intent is to follow the Linux Kernel numbering convention here and have even numbers be stable (or agreed upon) standards, and odd numbers to be experimental, with possible differing interpretations of headers.
The final number is a revision of the current major and minor numbers. It is hoped that this number will be 0 for even versions of the robots.txt standard.
This will follow the User-agent: header. If it does not immediately follow, or is missing, then the robot is to assume the rule set follows the Version 1.0.0 standard.
Only one Robot-version: header per rule set is allowed.
A version number of 1.0.0 1.0 is allowed.
When checking the version number, a robot can assume (if
the second digit is even) that a higher version number than
its looking for is okay (i.e. if a robot is looking for
version 2.0.0 2.0 and comes across
2.2.0 2.2, then it can still use the rule
set).
If a robot comes across a lower version number, then it will have to correctly parse the headers according to that version.
A robot, if it comes across an experiment version number,
should probably ignore that rule set and use the default.
It has been suggested that the version number present is more for documentation purposes than for content negotiation. This is still being decided, but a version number should be included.
A new RFC Draft for the robot exclusion protocol (which is a clarification of the Version 1.0 of the robot exclusion standard has added the Allow: directive.
This directive (if included) and the Disallow: directive are to be processed in the order they appear in the rule set. This is to simplify the processing, avoid ambiguity and allow more control over what is and isn't allowed.
If a URL is not covered by any allow or disallow rules, then the URL is to be allowed (as per the Version 1.0 spec).
An explicit match string has the highest precedence and
grants the robot the explicit permission to retrieve the
URL stated.
A regular expression has the lowest precedence and only grants the robot permission to retrieve the URLs matching only if any disallow rules do not filter out the URL (see Disallow).
If there are no disallow rules, then the robot is only allowed to retrieve the URLs that match the explicit and/or regular expressions given.
This directive and the Allow: directive (if included) are to be processed in the order they appear in the rule set. This is to simplify the processing, avoid ambiguity and allow more control over what is and isn't allowed.
If a URL is not covered by any allow or disallow rules, then the URL is to be allowed (as per the Version 1.0 spec).
Any URL matching the explicit match or the wild
card/regular expression is not to be retreived.
If there are no allow rules, then any URL not matching the rule(s) can be retrieved by the robot.
If there are allow rules, then explicit allows have a higher precedence than a disallow rule. Disallow rules have a higher precedence than regular expression allow rules. Any URL not matching the disallow rules have to then pass (any) regular expression allow rules. If there are no allow rules, then anything not covered by the disallow rule set is allowed.
This can only appear once per rule set.
More than one can appear in a rule set, allowing several
windows of access to a robot.
If more than one Request-rate: directive is given and does not include the time, use the one that requests the fewest documents per normalized unit of time.
A normalized rate is one document per X seconds. For example, a rate of 100/24h can be normalized as:
Or, 1 document every 864 seconds (about 11 minutes).
If no Request-rate: is given, then the robot is encouraged to use the following rule of thumb for time between requests:
twice the amount of time it took to retrieve the documentwhichever is slower.10 seconds
The follow examples are based upon a fictitious site called www.frommitz.biz with the following structure:
Given the lack of the Robot-version: directive, the following rule set automattically defaults to the
- /index.html
- /images/
- index.html
- fromlogo.jpg
- navbar.jpg
- blueball.gif
- redball.gif
- usamap.gif
- portrait.jpg
- /products.html
- /order.html
- /order.shtml
- /order.cgi
- /blackhole/
- index.html
- info98.html
- info98.shtml
- info99.html
- info99.gif
- info8.html
- page3.html
- info/
- index.html
- page1.html
- page2.shtml
- page4.html
- thankyou.html
- /overview.html
- /thankyou.html
Again, due to the lack of the Robots-version: directive, the following rule set follows the#----------------------------------------------------------- # The following robots only understand the1.0.01.0 spec, so # really limit where they can go #----------------------------------------------------------- User-agent: fredsbot User-agent: pandabot User-agent: chives Disallow: /images # anything starting with /images Disallow: /order.shtml # don't go there! Disallow: /order.cgi # nor there either! Disallow: /blackhole # there be bad karma here
While this is a slightly larger rule set than the last example, it reasonably covers all the cases so that as the site grows, this rule set doesn't have to. Also note that alfred, newchives and oscarbot are allowed to retrieve /images/index.html, since it is explicitly stated, but that other references under /images are not allowed. Also any references to server side include files are not allowed, nor are any images or CGI scripts.#------------------------------------------------------------------ # The following robot only understands the1.0.01.0 spec, # but since the search engine it represents is really popular, we want more # pages to be indexed. That means we need a longer rule set for this # particular robot. #------------------------------------------------------------------- User-agent: popularsite Disallow: /images Disallow: /order.shtml Disallow: /order.cgi Disallow: /blackhole/info98.shtml Disallow: /blackhole/info99.html Disallow: /blackhole/info99.gif Disallow: /blackhole/info8.html Disallow: /blackhole/info/page2.shtml
This robot is only allowed to retrieve documents at the rate of one document every 30 minutes. Also, it is only allowed to retrieve index.html files. Everything else is not allowed.#------------------------------------------------------------- # The following robots understand the2.0.02.0 spec, so we # can allow them a bit more freedom about what to do #-------------------------------------------------------------- User-agent: alfred User-agent: newchives User-agent: oscarbot Robot-version:2.0.02.0 # uses2.0.02.0 spec Allow: *index.html # allow any index pages Allow: /images/index.html # make sure we index this Allow: /blackhole/index.html # and we allow this page to be indexed Allow: /blackhole/info* # as well as these Disallow: * # nothing else will be allowedDisallow: *.shtml # don't index server include files Disallow: *.cgi # don't attempt to access cgi scripts Disallow: *.gif # no images Disallow: *.jpg Disallow: /images* # don't index here generally Disallow: /blackhole/info99* # these we don't want indexed Disallow: /blackhole/info8.html # nor this one
The following robot is instructed to only retrieve HTML documents (and only HTML documents) between the hours of 6:00 am and 8:45 am UT (GMT), which in this example, is 1:00 am and 3:45 am EST (the location of the fictitious web site).#----------------------------------------------------------------------- # the following robot understands the2.0.02.0 spec, but we # done like the people running it, so let's limit how fast it can retrieve # documents. #----------------------------------------------------------------------- User-agent: hackerbot Robot-version:2.0.02.0 # uses2.0.02.0 spec Request-rate: 1/30m # one document every 30 minutes Allow: *index.html # allow any index pages Disallow: * # but nothing else
The following robots can retrieve any HTML document, but depending upon the time they visit, are limited to how fast they are to retrieve the documents. Also, a comment is given explaining why they're being limited the way they are.#------------------------------------------------------------------------ # the following robot also understands the2.0.02.0 spec, but # we want to limit when it can visit the site #------------------------------------------------------------------------ User-agent: suckemdry Robot-version:2.0.02.0 Allow: *.html # only allow HTML pages Disallow: * # and nothing else Visit-time: 0600-0845 # and then only between 1 am to 3:45 am EST
The next example states that no robot should visit further. This follows#----------------------------------------------------------------------- # okay robots - but since they seem to keep trying over and over again, # so let's limit them and attempt to keep them accessing us during slow # times. #------------------------------------------------------------------------ User-agent: vacuumweb User-agent: spanwebbot User-agent: spiderbot Robot-version:2.0.02.0 Request-rate: 1/10m 1300-1659 # 8:00 am to noon EST Request-rate: 1/20m 1700-0459 # noon to 11:59 pm EST Request-rate: 5/1m 0500-1259 # midnight to 7:59 am EST Comment: because you guys try all the time, I'm gonna limit you Comment: to how many documents you can retrieve. So there! Allow: *.html Disallow: *
#-------------------------------------------------------------------- # go away you bother us #-------------------------------------------------------------------- User-agent: * Disallow: /