Do you want a search engine crawler to get access to certain folders? As an SEO, you may need that.
Even, if you want to block some pages of your website, you certainly can do that as well.
Do you want to block the search engine from accessing a certain directory on your site?
Webmasters have a variety of tricks up their sleeves. They can instruct search engine crawlers on how to crawl pages on their website. As part of their tactics, one has to mention the robots.txt file.
In this article, I am going to discuss what all the nitty-gritty of the robots.txt file.
Let’s explore each of these topics in depth:
What Is a Robots.txt file?
A robots.txt file is a text file. It follows a strict syntax, intelligible by search engine spiders. These spiders are also called robots, hence the name of the file. The strict syntax it follows has to be computer readable.
The file is created by webmasters to direct web robots on how to crawl webpages on their website. The robots.txt file is originally part of the robots exclusion protocol (REP). The protocol concerns a group of web standards that determine how robots are supposed to crawl the website.
This REP also regulates how robots should access and index content and serve content based on users. The REP is responsible for including directives likes subdirectory-, page-, or site-wide instructions for how search engines are supposed to treat links.
Basic Format of a Robots File
User-agent: [user-agent name]
Disallow: /[URL string not to be crawled]
How does the Robots.txt file work?
A site owner has to direct web crawlers on certain situations. For this reason, they put their robots.txt file in the root directory of their site, i.e. https://yoursitename.com/robots.txt.
Bots that follow REP will read and fetch the file before heading for any other file from the site. If the site doesn’t have a robots.txt file, the crawler will go for crawling the entire site. Therefore, you can see in absence of a robots.txt file, the crawler will assume that webmasters didn’t give any specific directions.
Robots.txt file is made up of two basic parts:
User-Agent in Detail
User-agent is the name given to the spider that is being addressed. The user-agent line always has to come before the directive lines of. And, you have to follow this order for each set of directives. A very basic format of robots.txt file looks like this:
These directives instruct the user-agent to stay away from crawling the entire server. As a result, it won’t crawl any page on the website. If you want to instruct multiple robots, create a set of user-agents and disallow directives for each one.
If you want to instruct multiple robots, create a set of user-agents and disallow directives for each one.
The above directives show that both Google and Bing’s user-agents know that they have to avoid crawling the entire site. If you want to opt for crawling of the entire server of your site, then the directive should be like this:
If you want to opt for crawling of the entire server of your site, then the directive should be like this:
List of most common search engine user-agents
Disallow in Detail
Disallow is the second part of the robots.txt file. The directive forbids spiders from crawling certain webpages. You can set multiple disallow lines for each set of directives. But, you have to include only one user-agent.
Bots will consider an empty disallow value as the directive that you aren’t disallowing anything. As a result, bots will choose to crawl the entire site.
To block crawlers from crawling a specific page, use the webpage’s relative link in the disallow line:
You can block access to whole directories the same way as well:
Furthermore, a robots.txt file can block bots from crawling certain file types. And, this can be done using a wildcard and a file type in the disallow line:
Pros and Cons of Using Robots.txt
- A website has an allowance to fix how many pages a search engine spider can crawl. SEO experts call this a crawl budget. You can use this budget in the best way possible by blocking sections of your site from the search engine while allowing your crawl budget to be used for other sections.
- Although you can block search engine crawlers from accessing your certain web pages, you cannot block them from showing up your URLs in the SERPs.
The robots.txt is one of the basic ways you can apply to tell a search engine where it can go and can’t go on your websites. I discussed everything you need to know about this useful file in this article.