The Synopsys Software Integrity Group is now Black Duck®. Learn More

close search bar

Sorry, not available in this language yet

close language selection

The 3 laws of robots.txt

Black Duck Editorial Staff

Mar 22, 2015 / 4 min read

Today, I will discuss how the robots.txt can be used by attackers to gain a foothold in your environment and how a low-risk finding in the robots.txt file can lead to further compromise. The robots.txt file is the de facto standard used by website developers. For the purposes of this post, I will use content management systems (CMS) such as WordPress and Drupal as examples to demonstrate how an attacker can gain that foothold. We'll explore this topic using the three laws of robots.txt.

Robots.txt Law #1

A robot may not injure a human being or, through inaction, allow a human being to come to harm.

According to the robots exclusion protocol (REP), the robots.txt file is used by website developers to provide instructions about their site to indexing web robots.

What is a robot?

Web robots, crawlers, or spiders are programs that visit the many pages on the Internet and recursively retrieve all the linked pages to be indexed for future searches. The REP was developed to reduce unwelcome or excessive robot requests and provides a method by which website developers are able to direct the robots to the parts of their sites they wished to be accessed. The robot will check for the robots.txt file parameters and traverse or index accordingly.

Robots.txt Law #2

A robot must obey the orders given it by human beings, except where such orders would conflict with Law #1.

In Gary McGraw’s recent article “When risk management goes bad,” he provides guidelines for risk analysis and risk management using software risk analysis as an example to illustrate how medium-risk issues can become high-risk issues. Dr. McGraw points out that it’s important to strike a balance in your security program by maintaining an even keel between security and good business practice, weighing up your assets and placing them into the following easy categories: high, medium, and low risk. He goes on to point out that it is important to have security controls spread across your entire portfolio, and no areas should be left naked.

In this example, I illustrate how a low risk finding such as finding a robots.txt can quickly become a high risk finding (e.g., gaining administrator access to the CMS). As the robot explores the website, it checks for the existence of the robots.txt file and finds allow and disallow parameters. These parameters can sometimes contain useful information for an attacker. For example, malware and email harvester bots can ignore the robots.txt file completely while harvesting email addresses or searching for website vulnerabilities. In addition, the file is in a known location and publicly accessible, making it easy for anyone, including attackers, to see what you’re trying to hide. The following examples are commonly found in the WordPress and Drupal robots.txt file. Both are open source content management systems (CMS) run on millions of popular websites.

WordPress example:

User-Agent: *
Disallow: /extend/themes/search.php
Disallow: /themes/search.php
Disallow: /support/rss
Disallow: /archive/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/ <— Check this to confirm a WordPress installation

Drupal example:

# robots.txt
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
# Files
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php <— Check this to confirm a Drupal installation
Disallow: /INSTALL.txt

The robots.txt file provides an attacker with a single place to check for information like this. Once it’s determined which CMS is in use, an attacker can focus their attack, enumerating specific version number vulnerabilities. An example of this is a known user enumeration vulnerability in WordPress, which allows you to discover the account names of the users on the site. Iterating through the user ids will reveal the users’ logins (e.g., wordpressexample.com/?author=1).

Once the attacker has determined the user account names, they can begin to perform brute force attacks against those accounts. Recently discovered and shared on the news, a WordPress search engine optimization (SEO) plugin is vulnerable to blind SQL injection, which can easily be discovered by running Nmap against a target:

nmap -sV --script http-wordpress-enum <target>

Robots.txt Law #3

A robot must protect its own existence as long as such protection does not conflict with Laws #1 and #2.

So what is the best use of REP to prevent vulnerabilities like these? Disallow everything?

User-agent: * Disallow: /

This is not ideal as your site will no longer be properly indexed by search engines and is terrible for your site’s SEO, so be specific. Only allow the robot to index the bare minimum. X-Robots-Tags (the HTTP header equivalent of a robots Meta tag) and robot Meta tags are page-level indexing controls which can be used by including the tags which are supported by all browsers on HTML pages or in HTTP headers. The robots Meta tag lets you control how pages are indexed. The X-Robots-Tag can be used in an HTTP header response, Meta tags can also be specified as X-Robots-Tag.

Update your WordPress core with a no-index patch, which involves sending an X-Robots-Tag header. This will remove all WordPress directories from search engine results. Another good solution is the use of robot Meta tags e.g. within your HTML which prevents the search results from being listed however still allows links to those desired pages.

Resolving other security issues in the CMS

If you use a CMS, there are some steps you should take to resolve other security issues that may still be present. You should scan your site for vulnerabilities, rename the default administrator account, and lock down the CMS admin access with .htaccess. To prevent future security issues, remember to regularly back up your CMS installation, keep your plugins up to date, and use encrypted methods to transfer your files, such as SFTP. You can also restrict your database, configuration files, and file permissions. You should also perform logging and log monitoring, and remember to use strong passwords and two factor authentication where possible.

This serves as an example of how a seemingly harmless file like robots.txt can be leveraged by an attacker to further their attacks against your environment. Strike a balance in your security program by weighing up your assets and give them the appropriate attention, build security in to your policies and procedures and soon those low-risk findings will be a thing of the past.

References:

Continue Reading

Explore Topics