A robots.txt file is a text document that doesn’t contain HTML markup code. It is hosted on the web server like other files on the website and tells search engine crawlers which pages on your website they are allowed to crawl and index. Crawlers that obey the instructions in the Robots.txt file are called “robots.” If you have a page on your website that you don’t want to be indexed by search engines, you can use robots.txt to block it from being crawled.
You can view it by typing the full URL of the homepage for any website and then adding robots.txt.
For example, https://jhseoagency.com/robots.txt
The file is not linked anywhere else on the site, so users will not come across it. Most web crawler bots look for this file before crawling the rest of the site.
How Does Robots.txt Work?
A robots.txt file is a text file that tells web robots (also known as spiders or crawlers) which pages on your website to crawl and which to ignore.
When a robot crawls a website, it reads the robots.txt file to check for instructions on which pages it should crawl and which it should ignore. The robots.txt file is part of the robots exclusion standard, a set of rules used by websites to communicate with web robots.
Robots use the standard to avoid crawling pages that are not intended for them, and websites also use it to prevent their pages from being crawled by robots that they do not want to crawl their site. The robots exclusion standard is an important part of how the internet works, and the robots.txt file is an essential part of that standard.
How Do I Create a Robots.txt File?
Creating a Robots.txt file is simple; you only need a text editor like Notepad or TextEdit. Just create a new text document and save it as “robots.txt” in the root directory of your website (e.g., example.com/robots.txt).
Once you’ve created your robots.txt file, you can upload it to your website’s root directory using an FTP client or your hosting control panel.
You can check out the detailed article about best practices to create a robots.txt file from Backlinko.
Uses of a Robots.txt File
Now that we’ve gone over the basics of robots.txt, let’s discuss some common uses for this file.
As we mentioned earlier, one of the most common use cases for robots.txt is to block all crawlers from crawling a specific page on your website. This is useful if you have a page that contains sensitive information that you don’t want to be indexed by search engines. For example, if you have a login page for your website, you may want to block all crawlers from indexing it so that people cannot find it through search engines.
Another common use case for robots.txt is to disallow certain pages from being crawled. This is useful if you have pages on your website that are irrelevant to search engines. For example, if you have a page that contains duplicate content, you may want to disallow it from being crawled so that it doesn’t get indexed and penalized by search engines.
Finally, you can use robots.txt to control the crawl rate of crawlers on your website. If you notice that crawlers are causing your server to slow down, you can use robots.txt to instruct them to crawl your website less frequently. It will help improve the performance of your website.
What Are the Common Robots.txt Directives?
The most common robots.txt directive is “User-agent.” With your instructions, this directive tells the crawler which type of robot you want to target.
For instance, if you want to block all robots from crawling a specific page, you would use the following directive: User-agent: * Disallow: /page-to-block.html
# Example 1: Block only Googlebot
User-agent: Googlebot
Disallow: /
# Example 2: Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /
# Example 3: Block all crawlers except AdsBot (AdsBot crawlers must be named explicitly)
User-agent: *
Disallow: /
These are also some examples of “disallow” that is a directory or page, relative to the root domain, that you don’t want the user agent to crawl.
Other common directives include “Allow” (which is related to disallow) and “Sitemap.” Allow is used to specify pages that should be crawled, while sitemap is used to specify the location of your website’s sitemap.
Sitemaps are a good way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl.
Sitemap: https://example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap.xml
Do You Need Robots.txt?
Now that we’ve gone over the basics of robots.txt, let’s discuss whether or not you need this file on your website.
If you have a small website with only a few pages, you probably don’t need a robots.txt. When a bot comes to your website which doesn’t have robots.txt, robots meta tags, or X-Robots-Tag HTTP headers; it will just crawl your website and index pages as it usually would. However, robots.txt file gives you more over what is being crawled. If you have a small website, you can use some common useful robots.txt rules like
User-agent: *
Disallow: /
Moreover, WordPress automatically creates a virtual robots.txt file for your site. So, if you are a small business owner , by using WordPress, even if you don’t do anything, you’ll have robots.txt.
However, if you have a large website with hundreds or thousands of pages, then robots.txt can be useful for controlling the crawl rate of crawlers and blocking certain pages from being indexed.
In general, we recommend that most websites include a robots.txt file to ensure their site is crawled and indexed correctly by search engines. However, if you’re unsure whether you need robots.txt, we recommend consulting with a professional SEO company or developer to get their opinion.
Which Method Should I Use to Block Crawlers?
It depends. In short, there are good reasons to use each of these methods:
Robots.txt: If you want to block all crawlers from crawling a specific page, you will use the following directive: User-agent: * Disallow: /page-to-block.html
Robots meta tag: To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the <head> section of your page:
<meta name="robots" content="noindex,nofollow">
To prevent only Google web crawlers from indexing a page:
<meta name="googlebot" content="noindex">
X-Robots-Tag header field: You can also use the X-Robots-Tag header field to control how search engines crawl and index your website. For example, if you want to block all crawlers from indexing a specific page, you can add the following header field to your server’s configuration file: X-Robots-Tag: noindex
Where is the Robots.txt File Located?
The robots.txt file is located in the root directory of your website (e.g., example.com/robots.txt). You can create this file using a text editor and save it in the root directory of your website. Once you’ve done this, you can start adding instructions for crawlers.