An Extended Standard for Robot Exclusion

Table of Contents:

Updates of this draft

Updates as of 11:40pm EST November 5, 2002: Updates as of 3:00am EST November 16, 1996: Updates as of 7:00pm EST November 11, 1996:

NOTE: Old text is marked through using <strike> </strike> which may not be supported by all browsers. If this is the case, then I will remove this feature.


Status of this document

This document represents some informal extentions that have yet to be agreed upon. A preliminary version of this document was posted to the robots mailing list (robots-request@webcrawler.com). This document is based upon that preliminary version.

This is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anyone, and there are no guarantees that all current and future robots will use it. These are just proposed extentions to the current robot exclusion standard.

The latest version of this document can be found at http://www.conman.org/people/spc/robots2.html .


Introduction

For some time now, it has been apparent that the current robot exclusion standard has some deficicencies in allowing the administrators of web servers more control over where robots are allowed and not allowed to visit. There are also no mechanism in place stating what are good times for robots to visit, nor a mechanism stating how fast a robot can safely pull documents, as well as the reasons stated in the original robot exclusion standard.

This document proposes such extensions to the robot exclusion standard. The original standard is referred to as Version 1.0.0 Version 1.0, while the extentions proposed here are referred to as Version 2.0.0 Version 2.0 of the robot exclusion standard.

More information about robots in general can be found on the Robots Exclustion World Wide Web Robots, Wanderers, and Spiders page.


The Method

The following is taken verbatim from the original robot exclusion standard. That standard covers Version 1.0.0 Version 1.0 of the robot exclusion standard, while this covers Version 2.0.0 Version 2.0 of the robot exclusion standard. The method hasn't changed between Version 1.0.0 Version 1.0 and Version 2.0.0 Version 2.0 though.

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.

A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.

The choice of the URL was motivated by several criteria:


The Format

The format for the /robots.txt file is a series of rule sets, which consist of one or more User-agent: directives followed by one or more other directives.

Tags

The following tags are used in defining the data for each directive:
<comment> - Version 1.0,2.0 general comment line format

Any text following a "#" up to the end of a line is to be ignored. The "#" character can appear at any portion only after a blank (or whitespace) character, or at the start of a line. Some examples:

# this is a comment line
#so is this
User-agent: fredsbot # this is another comment
Disallow: * # we don't like this bot

<general> - Version 1.0.0 Version 1.0 general string match format.

The general match is included for compatibility with Version 1.0.0 Version 1.0 of the robots exclusion standard. General matches do not contain regular expression characters, but are treated as if they contain the character "*", which is used to match zero or more characters, at the end of the string. An example would be: /helpme, which is to be treated as: /helpme*.

This exists solely for Version 1.0.0 Version 1.0 compatibility and their usage can be determined by context.

<explicit> - Version 2.0.0 Version 2.0 explicit string match format.

This contains no regular expression characters and any string to be matched using an explicit string match must match all the characters present exactly.

<regex> - Version 2.0.0 Version 2.0 regular expression string match format.

This is a regular expression that is compatible with that used in Perl, a popular language used on the web that contains support for regular expression matching.

This is a regular expression that is compatible with that used by /bin/sh, a shell found on all implementations of UNIX. Within the pattern string:

<version> - Version 2.0.0 Version 2.0 version numbering scheme format.

See the directive Robot-version for more details.

<time> - Version 2.0 Version 2.0 time format.

This is in the format of HH:MM, where HH is in 24 hour time (inclusive between 00 and 23) and MM is minutes (inclusive between 00 and 59).

The time is specified in UT (or GMT) time.

<rate> - Version 2.0.0 Version 2.0 rate format.

<rate> is defined as <numdocuments> '/' <timeunit>. The default time unit is the second. See the directive Request-rate for more information.

An example of some rates:

10/60 - no more than 10 documents per 60 secs
10/10m - no more than 10 documents per 10 mins
20/1h - no more than 20 documents per hour

Directives

Each directive has the following format:

<directive> ':' [<whitespace>] <data> [<whitespace>] [<comment>] <end-of-line>

where <data> depend upon the directive and items in "[" and "]" are optional. Unless otherwise noted, each directive can appear more than once in a given rule set. The following directives are defined for Version 2.0.0 Version 2.0

User-agent
Format:
User-agent: <general> # Version 1.0.0 , 2.0.0 Version 1.0 , 2.0
Comment:
This is the same format as in Version 1.0.0 Version 1.0, with the added note that the default rule set should follow the Version 1.0.0 Version 1.0 standard format of:

User-agent: *
Disallow: /

See the original robot exclusion standard for more information.

Robot-version
Format:
Robot-version: <version> # Version 2.0.0 Version 2.0
Comment:
The version is a three part number, with each part separated by a period. The version is a two part number, separated by a period.

The first number indicates major revisions to the robots.txt standard. The second number represents clarifications or fixes to the robots.txt standard. The first part is the major version number of the robots.txt standard. Only drastic changes to the standard shall cause this number to be increased. Valid numbers for this part are 1 and 2.

The second part is for partial upgrades, clarifications or small added extensions. My intent is to follow the Linux Kernel numbering convention here and have even numbers be stable (or agreed upon) standards, and odd numbers to be experimental, with possible differing interpretations of headers.

The final number is a revision of the current major and minor numbers. It is hoped that this number will be 0 for even versions of the robots.txt standard.

This will follow the User-agent: header. If it does not immediately follow, or is missing, then the robot is to assume the rule set follows the Version 1.0.0 standard.

Only one Robot-version: header per rule set is allowed.

A version number of 1.0.0 1.0 is allowed.

When checking the version number, a robot can assume (if the second digit is even) that a higher version number than its looking for is okay (i.e. if a robot is looking for version 2.0.0 2.0 and comes across 2.2.0 2.2, then it can still use the rule set).

If a robot comes across a lower version number, then it will have to correctly parse the headers according to that version.

A robot, if it comes across an experiment version number, should probably ignore that rule set and use the default.

It has been suggested that the version number present is more for documentation purposes than for content negotiation. This is still being decided, but a version number should be included.

Allow
Format:
Allow: <general> # 2.0
Allow: <explicit> # 2.0.0 2.0
Allow: <regex> # 2.0.0 2.0
Comment:
Pending discussion, this directive may not make it into Version 2.0 of the robot exclusion standard.

A new RFC Draft for the robot exclusion protocol (which is a clarification of the Version 1.0 of the robot exclusion standard has added the Allow: directive.

Version 1.0
See the new RFC Drafr for the Version 1.0 behavior, except to note that a general match can be turned into a regular express match by adding a "*" to the end of the string.

Version 2.0
Pending discussion, Version 2.0 semantics of this directive may not be implemented.

This directive (if included) and the Disallow: directive are to be processed in the order they appear in the rule set. This is to simplify the processing, avoid ambiguity and allow more control over what is and isn't allowed.

If a URL is not covered by any allow or disallow rules, then the URL is to be allowed (as per the Version 1.0 spec).

An explicit match string has the highest precedence and grants the robot the explicit permission to retrieve the URL stated.

A regular expression has the lowest precedence and only grants the robot permission to retrieve the URLs matching only if any disallow rules do not filter out the URL (see Disallow).

If there are no disallow rules, then the robot is only allowed to retrieve the URLs that match the explicit and/or regular expressions given.

Disallow
Format:
Disallow: <general> # 1.0.0 1.0
Disallow: <explicit> # 2.0.0 2.0
Disallow: <regex> # 2.0.0 2.0
Comment:
Version 1.0.0 Version 1.0
See the current robots.txt for the Version 1.0.0 Version 1.0 behavior, except to note that a general match can be turned into a regular expression match by adding a "*" to the end of the string.

Version 2.0.0 Version 2.0
Pending discussion, Version 2.0 semantics of this directive may not be implemented.

This directive and the Allow: directive (if included) are to be processed in the order they appear in the rule set. This is to simplify the processing, avoid ambiguity and allow more control over what is and isn't allowed.

If a URL is not covered by any allow or disallow rules, then the URL is to be allowed (as per the Version 1.0 spec).

Any URL matching the explicit match or the wild card/regular expression is not to be retreived.

If there are no allow rules, then any URL not matching the rule(s) can be retrieved by the robot.

If there are allow rules, then explicit allows have a higher precedence than a disallow rule. Disallow rules have a higher precedence than regular expression allow rules. Any URL not matching the disallow rules have to then pass (any) regular expression allow rules. If there are no allow rules, then anything not covered by the disallow rule set is allowed.

Visit-time
Format:
Visit-time: <time> <time> # 2.0.0
Visit-time: <time> '-' <time> # 2.0
Comment:
The robot is requested to only visit the site between the given times. If the robot visits outside of this time, it should notify its author/user that the site only wants it between the times specified.

This can only appear once per rule set. More than one can appear in a rule set, allowing several windows of access to a robot.

Request-rate
Format:
Request-rate: <rate> # 2.0.0 2.0
Request-rate: <rate> <time> <time> # 2.0.0
Request-rate: <rate> <time> '-' <time> # 2.0.0 2.0
Comment:
If time is given, then the robot is to use the given rate (and no faster) if the time is between the times given.

If more than one Request-rate: directive is given and does not include the time, use the one that requests the fewest documents per normalized unit of time.

A normalized rate is one document per X seconds. For example, a rate of 100/24h can be normalized as:

  • 100/24h = 100/24 * 60 * 60
  • 100/24h = 100/24 * 3600
  • 100/24h = 100/86400 # 100 documents/86400 seconds
  • 100/24h = 1/864

Or, 1 document every 864 seconds (about 11 minutes).

If no Request-rate: is given, then the robot is encouraged to use the following rule of thumb for time between requests:

twice the amount of time it took to retrieve the document

10 seconds

whichever is slower.

Comment
Format:
Comment: <text> <end-of-line> # 2.0.0 2.0
Comment:
These are comments that the robot is encouraged to send back to the author/user of the robot. All Comment:'s in a rule set are to be sent back (at least, that's the intention). This can be used to explain the robot policy of a site (say, that one government site that hates robots).

An empty /robots.txt file has no associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Examples

Please note that these examples use the Allow: and Disallow: directives as defined in this document. These directives may or may not be in the final draft as defined here.

The follow examples are based upon a fictitious site called www.frommitz.biz with the following structure:

Given the lack of the Robot-version: directive, the following rule set automattically defaults to the Version 1.0.0 Version 1.0 robots exclustion standard. Therefore, the only files that fredsbot, pandabot, and chives will be able to retrieve will be /index.html, /products.html, /overview.html and thankyou.html. Everything else is offlimits to these robots.

#-----------------------------------------------------------
# The following robots only understand the 1.0.0 1.0 spec, so
# really limit where they can go
#-----------------------------------------------------------

User-agent: fredsbot
User-agent: pandabot
User-agent: chives
Disallow: /images		# anything starting with /images
Disallow: /order.shtml		# don't go there!
Disallow: /order.cgi		# nor there either!
Disallow: /blackhole		# there be bad karma here

Again, due to the lack of the Robots-version: directive, the following rule set follows the Version 1.0.0 Versin 1.0 robot exclusion standard. Note that if we only want certain pages indexed, we need to exclude almost explicitly what isn't allowed. As the site grows, this can create large rule sets which may break certain robots.

#------------------------------------------------------------------
# The following robot only understands the 1.0.0 1.0 spec,
# but since the search engine it represents is really popular, we want more
# pages to be indexed.	That means we need a longer rule set for this
# particular robot.
#-------------------------------------------------------------------

User-agent: popularsite
Disallow: /images
Disallow: /order.shtml
Disallow: /order.cgi
Disallow: /blackhole/info98.shtml
Disallow: /blackhole/info99.html
Disallow: /blackhole/info99.gif
Disallow: /blackhole/info8.html
Disallow: /blackhole/info/page2.shtml

While this is a slightly larger rule set than the last example, it reasonably covers all the cases so that as the site grows, this rule set doesn't have to. Also note that alfred, newchives and oscarbot are allowed to retrieve /images/index.html, since it is explicitly stated, but that other references under /images are not allowed. Also any references to server side include files are not allowed, nor are any images or CGI scripts.

#-------------------------------------------------------------
# The following robots understand the 2.0.0 2.0 spec, so we
# can allow them a bit more freedom about what to do
#--------------------------------------------------------------

User-agent: alfred
User-agent: newchives
User-agent: oscarbot
Robot-version: 2.0.0 2.0	# uses 2.0.0 2.0 spec
Allow: *index.html		# allow any index pages
Allow: /images/index.html	# make sure we index this
Allow: /blackhole/index.html	# and we allow this page to be indexed
Allow: /blackhole/info* 	# as well as these
Disallow: *			# nothing else will be allowed
Disallow: *.shtml		# don't index server include files
Disallow: *.cgi 		# don't attempt to access cgi scripts
Disallow: *.gif 		# no images
Disallow: *.jpg
Disallow: /images*		# don't index here generally
Disallow: /blackhole/info99*	# these we don't want indexed
Disallow: /blackhole/info8.html # nor this one

This robot is only allowed to retrieve documents at the rate of one document every 30 minutes. Also, it is only allowed to retrieve index.html files. Everything else is not allowed.

#-----------------------------------------------------------------------
# the following robot understands the 2.0.0 2.0 spec, but we
# done like the people running it, so let's limit how fast it can retrieve
# documents.
#-----------------------------------------------------------------------

User-agent: hackerbot
Robot-version: 2.0.0 2.0	# uses 2.0.0 2.0 spec
Request-rate: 1/30m		# one document every 30 minutes
Allow: *index.html		# allow any index pages
Disallow: *			# but nothing else

The following robot is instructed to only retrieve HTML documents (and only HTML documents) between the hours of 6:00 am and 8:45 am UT (GMT), which in this example, is 1:00 am and 3:45 am EST (the location of the fictitious web site).

#------------------------------------------------------------------------
# the following robot also understands the 2.0.0 2.0 spec, but
# we want to limit when it can visit the site
#------------------------------------------------------------------------

User-agent: suckemdry
Robot-version: 2.0.0 2.0
Allow: *.html			# only allow HTML pages
Disallow: *			# and nothing else
Visit-time: 0600-0845		# and then only between 1 am to 3:45 am EST

The following robots can retrieve any HTML document, but depending upon the time they visit, are limited to how fast they are to retrieve the documents. Also, a comment is given explaining why they're being limited the way they are.

#-----------------------------------------------------------------------
# okay robots - but since they seem to keep trying over and over again,
# so let's limit them and attempt to keep them accessing us during slow
# times.
#------------------------------------------------------------------------

User-agent: vacuumweb
User-agent: spanwebbot
User-agent: spiderbot
Robot-version: 2.0.0 2.0
Request-rate: 1/10m 1300-1659		# 8:00 am to noon EST
Request-rate: 1/20m 1700-0459		# noon to 11:59 pm EST
Request-rate: 5/1m  0500-1259		# midnight to 7:59 am EST
Comment: because you guys try all the time, I'm gonna limit you
Comment: to how many documents you can retrieve.  So there!
Allow: *.html
Disallow: *

The next example states that no robot should visit further. This follows Version 1.0.0 Version 1.0 of the spec and all robots should understand this format.

#--------------------------------------------------------------------
# go away you bother us
#--------------------------------------------------------------------

User-agent: *
Disallow: /


Example Code

Unfortunately, there is no example code to parse Version 2.0.0 Version 2.0 of the robot exclusion standard.

Author Information

Sean Conner <sean@conman.org> is now an independant software programmer. Back in 1996 when this document was being developed he was the Vice President of Research and Development for Armigeron Information Services, Inc., which allowed him the time and resources to produce and host this document.

Home


Copyright © 1996 by Sean Conner. All Rights Reserved.