• Share on Google+

What is a robots.txt file?

A robots.txt file is quite simply a TXT file that lives on your site’s root that provides search engines with information about which parts of your site they are allowed to crawl.

More importantly, it tells web robots what areas of the site that you don’t want them to access.

A few historical and practical notes about the robots.txt file standard:

  • The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions: the original 1994 A Standard for Robot Exclusion document, and a 1997 Internet Draft specification A Method for Web Robots Control
  • Some additional external resources: HTML 4.01 specification, Appendix B.4.1, and Wikipedia – Robots Exclusion Standard
  • Different crawlers may interpret syntax differently
  • The /robots.txt standard is not actively developed.
  • Web robots can choose to ignore your site’s /robots.txt. This is especially common with malware robots looking for security vulnerabilities.
  • The /robots.txt file is a publicly available file. This means that anyone can see what sections of your server you don’t want robots to use so don’t try to use /robots.txt to hide information.
  • Your robots.txt directives can’t prevent references to your URLs from other sites

Where can you find your robots.txt file?

In the root directory of your site. As an example, here’s mine:

Why is this file so important?

The robots.txt file directly controls indexing, and if treated improperly can allow your site (or parts of your site) to be completely de-indexed, or display sub-optimally in Search Engine Results Pages (SERP’s).

While it’s certainly not the most glamorous aspect of SEO, your robots.txt file has the potential to be pretty impactful to SEO performance.

Do you technically need it?

No, the absence of a robots.txt file will not prohibit search engines from crawling your website.

However, I’d highly recommended having one as it has been a web standard for 20+ years, it can allow you to control the indexation of your site, and you can use it to submit critical pages to search engines via your XML sitemap.

Robots.txt formatting & rules

User agents

First, all robots.txt files must have a user-agent which defines which sections the rules should be applied to.

Using a  User-agent: *  applies to all web robots.

User-agent: *

However, you can target specific bots with specific rules. Here’s an example of how to target the  Googlebot :

User-agent: Googlebot

See a list of all user agents/web robots. As a note, you can include multiple user-agents with unique rules for each in a single robots.txt file.

Directives

Below are some examples of how to use the robots.txt file to control indexing of your site or specific folders.

To block crawlers from accessing all site content

User-agent: * 
Disallow: /

To allow crawlers access to all site content

User-agent: * 
Disallow: 

To exclude crawlers from accessing specific folders & pages

User-agent: * 
Disallow: /wp-content/
Disallow: /wp-plugins/
Disallow: /example-folder/example.html

To exclude a single robot from crawling, but allow others

User-agent: * 
Disallow: 

User-agent: Baiduspider 
Disallow: /

To exclude access to a specific folder, but allow certain file types

User-agent: * 
Disallow: /example/
Allow: /example/*.jpg
Allow: /example/*.gif
Allow: /example/*.png
Allow: /example/*.css
Allow: /example/*.js

Although there technically is no  Allow  attribute, I’ve used this method and it does work.

In fact, here’s an example screenshot of it working on my site (according to Google’s Robots.txt tester tool).

Robots.txt testing tool example

To allow access to a specific folder, but exclude certain file types
Just the opposite of the above method.

User-agent: *
Allow: /wp-content/plugins/
Disallow: /wp-content/plugins/*.png

To disallow access to a specific file type

User-agent: *
Disallow: /*.gif$

There are certainly more ways that you can customize the robots.txt file, but these should get you started.

Sitemap Link

Below is the format for adding a link to your site’s XML sitemap.

User-agent: * 
Disallow: 
Sitemap: https://jacobstoops.com/sitemap_index.xml

Best practices for configuring your robots.txt file for SEO?

1. It must be named robots.txt, must be a TXT file, and must be located at the root directory of your site (e.g. example.com/robots.txt)

You must apply the following saving conventions so that web crawlers can find and identify your robots.txt file. For example, because crawlers only look for this file in the site’s root directory, if you save it in a sub-directory (e.g. example.com/directory/robots.txt) they won’t use it.

2. Only exclude files and site directories that you wish not to be indexed

For example, if you have directories that would lead to duplicate content issues, you can use the robots.txt file to control the indexation of these.

Additionally, it’s probably wise to disallow indexing of files that would contain sensitive data such as phone numbers, transaction information, etc (although these things are probably better controlled via HTTPS).

3. Don’t disallow access to your entire site, unless you really don’t want it crawled

This is probably the biggest “no-no” on the SEO side. Unfortunately, it does happen.

Client examples: In my ~10 years of experience, I’ve remember this occurring at least 2 times – one with a small site and one with a large e-commerce site. In both cases, the sites were almost fully de-indexed for a period, and in the case of the large e-commerce site, the revenue ramifications from natural search were extremely severe.

Also in both cases, the change came from someone outside of the SEO team, which hints that maybe they didn’t fully understand what they are doing. Both cases were unfortunate setbacks for the respective SEO programs.

4. Do not block CSS, Javascript, or Image files with the robots.txt file (unless there is a specific reason to do so)

In October of 2014, Google updated their technical webmaster guidelines around indexing of CSS, JavaScript, images, etc.

Here’s what they said:

Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.

To see if you’re blocking any critical resources, you can use Google Search Console’s Blocked Resources report:

Google Search Console Blocked Resources report

5. Always include a link to the location of your primary XML sitemap or sitemap index

This is a great way for search engines to access your site’s XML sitemap, especially if you haven’t already submitting it via Google Search Console.

6. Review proposed robots.txt rules in using Google’s Robots.txt testing tool prior to publishing live to production

This will help to ensure that no important pages are accidentally blocking search engine access.

See: https://support.google.com/webmasters/answer/6062598?hl=en&ref_topic=6061961

7. Review (and update as necessary) the robots.txt file regularly to ensure no issues exist

A lot of things can happen in the course of deployments, code releases, etc. Reviewing your robots.txt file in combination will tools such as Google Search Console can help ensure that you’re handling your site’s indexing and robots.txt file configuration appropriately.

8. Use in conjunction with the on-page noindex tag for best control of indexing

Blocking URL’s from being indexed via the robots.txt file may not prohibit those pages/files from showing up as URL only listings in SERP’s – especially if the pages were indexed prior to excluding them.

If a page is already indexed, but being blocked – the following message will show in SERP’s:

Example of robots.txt description in SERP's

The best solution for completely blocking the index of a particular page is to use a  robots meta noindex  tag on a per page basis along with the robots.txt file.

This is the best way to stop pages from getting in the index in the first place, or to get pages already indexed to be removed.

Recommended configuration for WordPress?

If you’re using WordPress, it’s pretty easy to edit your robots.txt file using either FTP or with a plugin such as Yoast SEO for Worpress. Simply follow these steps to make changes to your robots.txt file.

In terms of the best way to configure, there are many ways to skin a cat.

Here’s how I’ve configured mine:

User-agent: *

Disallow: /wp-content/plugins/
Allow: /wp-content/plugins/*.jpg
Allow: /wp-content/plugins/*.gif
Allow: /wp-content/plugins/*.png
Allow: /wp-content/plugins/*.css
Allow: /wp-content/plugins/*.js

Disallow: /wp-admin/

Sitemap: https://jacobstoops.com/sitemap_index.xml

I’ve ensured that key content will be crawled, but this configuration will ensure that key directories are blocking while allowing things such as Javascript, CSS, and images to be indexed.

I’ve also included a link to my XML sitemap index file.

Additional Robots.txt Resources

Image credit: Drive E-News