What is a robots.txt file?
A robots.txt file is quite simply a TXT file that lives on your site’s root that provides search engines with information about which parts of your site they are allowed to crawl.
More importantly, it tells web robots what areas of the site that you don’t want them to access.
A few historical and practical notes about the robots.txt file standard:
- The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions: the original 1994 A Standard for Robot Exclusion document, and a 1997 Internet Draft specification A Method for Web Robots Control
- Some additional external resources: HTML 4.01 specification, Appendix B.4.1, and Wikipedia – Robots Exclusion Standard
- Different crawlers may interpret syntax differently
- The /robots.txt standard is not actively developed.
- Web robots can choose to ignore your site’s /robots.txt. This is especially common with malware robots looking for security vulnerabilities.
- The /robots.txt file is a publicly available file. This means that anyone can see what sections of your server you don’t want robots to use so don’t try to use /robots.txt to hide information.
- Your robots.txt directives can’t prevent references to your URLs from other sites
Where can you find your robots.txt file?
In the root directory of your site. As an example, here’s mine:
Why is this file so important?
The robots.txt file directly controls indexing, and if treated improperly can allow your site (or parts of your site) to be completely de-indexed, or display sub-optimally in Search Engine Results Pages (SERP’s).
While it’s certainly not the most glamorous aspect of SEO, your robots.txt file has the potential to be pretty impactful to SEO performance.
Do you technically need it?
No, the absence of a robots.txt file will not prohibit search engines from crawling your website.
However, I’d highly recommended having one as it has been a web standard for 20+ years, it can allow you to control the indexation of your site, and you can use it to submit critical pages to search engines via your XML sitemap.
Robots.txt formatting & rules
First, all robots.txt files must have a user-agent which defines which sections the rules should be applied to.
Using a User-agent: * applies to all web robots.
However, you can target specific bots with specific rules. Here’s an example of how to target the Googlebot :
See a list of all user agents/web robots. As a note, you can include multiple user-agents with unique rules for each in a single robots.txt file.
Below are some examples of how to use the robots.txt file to control indexing of your site or specific folders.
To block crawlers from accessing all site content
User-agent: * Disallow: /
To allow crawlers access to all site content
User-agent: * Disallow:
To exclude crawlers from accessing specific folders & pages
User-agent: * Disallow: /wp-content/ Disallow: /wp-plugins/ Disallow: /example-folder/example.html
To exclude a single robot from crawling, but allow others
User-agent: * Disallow: User-agent: Baiduspider Disallow: /
To exclude access to a specific folder, but allow certain file types
User-agent: * Disallow: /example/ Allow: /example/*.jpg Allow: /example/*.gif Allow: /example/*.png Allow: /example/*.css Allow: /example/*.js
Although there technically is no Allow attribute, I’ve used this method and it does work.
In fact, here’s an example screenshot of it working on my site (according to Google’s Robots.txt tester tool).
To allow access to a specific folder, but exclude certain file types
Just the opposite of the above method.
User-agent: * Allow: /wp-content/plugins/ Disallow: /wp-content/plugins/*.png
To disallow access to a specific file type
User-agent: * Disallow: /*.gif$
There are certainly more ways that you can customize the robots.txt file, but these should get you started.
Below is the format for adding a link to your site’s XML sitemap.
User-agent: * Disallow: Sitemap: https://jacobstoops.com/sitemap_index.xml
Best practices for configuring your robots.txt file for SEO?
1. It must be named robots.txt, must be a TXT file, and must be located at the root directory of your site (e.g. example.com/robots.txt)
You must apply the following saving conventions so that web crawlers can find and identify your robots.txt file. For example, because crawlers only look for this file in the site’s root directory, if you save it in a sub-directory (e.g. example.com/directory/robots.txt) they won’t use it.
2. Only exclude files and site directories that you wish not to be indexed
For example, if you have directories that would lead to duplicate content issues, you can use the robots.txt file to control the indexation of these.
Additionally, it’s probably wise to disallow indexing of files that would contain sensitive data such as phone numbers, transaction information, etc (although these things are probably better controlled via HTTPS).
3. Don’t disallow access to your entire site, unless you really don’t want it crawled
This is probably the biggest “no-no” on the SEO side. Unfortunately, it does happen.
Here’s what they said:
To see if you’re blocking any critical resources, you can use Google Search Console’s Blocked Resources report:
5. Always include a link to the location of your primary XML sitemap or sitemap index
This is a great way for search engines to access your site’s XML sitemap, especially if you haven’t already submitting it via Google Search Console.
6. Review proposed robots.txt rules in using Google’s Robots.txt testing tool prior to publishing live to production
This will help to ensure that no important pages are accidentally blocking search engine access.
7. Review (and update as necessary) the robots.txt file regularly to ensure no issues exist
A lot of things can happen in the course of deployments, code releases, etc. Reviewing your robots.txt file in combination will tools such as Google Search Console can help ensure that you’re handling your site’s indexing and robots.txt file configuration appropriately.
8. Use in conjunction with the on-page noindex tag for best control of indexing
Blocking URL’s from being indexed via the robots.txt file may not prohibit those pages/files from showing up as URL only listings in SERP’s – especially if the pages were indexed prior to excluding them.
If a page is already indexed, but being blocked – the following message will show in SERP’s:
The best solution for completely blocking the index of a particular page is to use a robots meta noindex tag on a per page basis along with the robots.txt file.
This is the best way to stop pages from getting in the index in the first place, or to get pages already indexed to be removed.
Recommended configuration for WordPress?
If you’re using WordPress, it’s pretty easy to edit your robots.txt file using either FTP or with a plugin such as Yoast SEO for Worpress. Simply follow these steps to make changes to your robots.txt file.
In terms of the best way to configure, there are many ways to skin a cat.
Here’s how I’ve configured mine:
User-agent: * Disallow: /wp-content/plugins/ Allow: /wp-content/plugins/*.jpg Allow: /wp-content/plugins/*.gif Allow: /wp-content/plugins/*.png Allow: /wp-content/plugins/*.css Allow: /wp-content/plugins/*.js Disallow: /wp-admin/ Sitemap: https://jacobstoops.com/sitemap_index.xml
I’ve also included a link to my XML sitemap index file.
Additional Robots.txt Resources
- Google: Learn about robots.txt file
- Google: Create a robots.txt file
- Google: Robots.txt specifications
- Google: Robots.txt FAQ
Image credit: Drive E-News