How to create perfect robots.txt file

The robots.txt file is also known as the robots exclusion protocol (REP) or standard, which is a text file that tells web robots which pages on your site to crawl. The robots exclusion protocol also includes directives like Meta robots, as well as page- sub directory, or site-wide instruction for how search engines should treat links.

Learn robots.txt syntax

The robots.txt file uses two words, User-agent and Disallow. User-agents are search engine robots which are also known as web crawler software. Most of the user-agents are listed in the Web Robot Database.

Disallow is a command for the user-agent that says it not to access a particular URL. Google uses numerous user-agents such as Googlebot for Google Search and Googlebot Image for Google Image Search.

The syntax for these keywords is as follows-

User-agent: it is one of the specific web crawler to which the user is giving instructions. [The name of the robot the following rule applies to]

Disallow: the command which is used by the user-agents not to crawl on a particular URL. [The URL path you want to block]

Allow: the command which is used by the user-agents so as to give the permission to access a page or subfolder even its parent page or subfolder is disallowed. [The URL path in of a subdirectory, within a blocked parent directory that you want to unblock]

Crawl-delay:  It calculates the time period a crawler waits before loading and crawling page content.

Sitemap:  it is used to call out the location of any XML sitemap(s) associated with this URL. This command is only supported by Google, Ask, Bing, and Yahoo.

How to create robots.txt file?

The robots.txt file communicates to the web roots or a search engine that which page it has to crawl and which page not to crawl.

Before visiting any site i.e. the target page, the search engine checks the robots.txt file for instructions.

Let’s have a look at the example-

Basic format:

User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

Together, these two lines are considered a complete robots.txt file — though one robots file can contain multiple lines of user agents and directives (i.e., disallows, allows, crawl-delays, etc.).

Here are a few examples of robots.txt in action for a www.example.com site:-

Robots.txt file URL: www.example.com/robots.txt
  • To exclude all robots from the entire server
User-agent: * 
Disallow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.

[highlight color=”red”]Note:-[/highlight]

  •   The *(asterisk) after user-agent shows that the robots.txt file is applied to all web robots that visit the site.
  •   The / (slash) after disallow tells to the robot that it should not visit any pages on the site.

User can set the user-agent command to all web crawlers(googlebot , msnbot, slurp etc) by listing an asterisk (*) as in the example below:

User-agent: *
  • To allow all robots complete access
User-agent: * 
Disallow:

(or just create an empty “/robots.txt” file, or don’t use one at all)

Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.

  • Blocking a specific web crawler from a specific folder
User-agent: Googlebot 
Disallow: /example-subfolder/

This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/.

  • To exclude all robots from part of the server
User-agent: *
Disallow: /example-subfolder/
Disallow: /example-folder/

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages that contain the URL string www.example.com/example-subfolder/, www.example.com/example-folder/.

  • Blocking a specific web crawler from a specific web page
User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.

User can save robots.txt file by saving conventions so that Googlebot and other web crawlers can find and identify the robots.txt file.

  • Save the robots.txt code as the text file
  • Place the file in the highest-level directory of your site
  • The robots.txt file must be named as robots.txt.

After creating robots.txt file to make sure everything’s valid and operating the right way. Google provides a robots free tester  in webmaster tools ,signup in webmaster tools and check everything is good or not.

robots free tester

 Have you  a robots.txt file?

If you don’t know that how to check a live robots.txt file ? Go to  in your root domain, then add /robots.txt to the end of the URL.

bloghashtag robots txt

If no .txt page appears, you do not currently have a (live) robots.txt page.

Working of robots.txt

Search engine has main two jobs to perform:-

  • Crawling the web so as to discover the content
  • Indexing the content so that the user may find the content easily.

Other quick robots.txt must-knows:-

  • txt file should be placed in a website’s top-level directory
  • txt is case sensitive.
  • txt file is publicly available
  • Each subdomain on a root domain uses separate robots.txt files
  • It helps in indicating the location of any sitemaps.

Why the robots.txt file is important?

  • It prevents the duplicate content from appearing in SERPs.
  • Keeps the entire sections of a website private
  • Prevents search engines from indexing certain files on the website.
  • Specifies the crawl delay in order to prevent the servers from being overloaded when crawlers load multiple pieces at once.

SEO Practices

The major goal of SEO is to get search engines to crawl the sites which help them to increase the ranking of the page.

A search engine crawler scans the site and index the content so as to make the work easier of other users. The user should not block the content or any section of the website that he wants to crawl.

Not to use robots.txt to prevent sensitive data.

Robots.txtvs.Meta robots vs. x-robots

Robots.txt is an actual text file whereas Meta and x-robots are Meta directives. Robots.txt can dictate indexation behavior, whereas Meta and x-robots can dictate indexation behavior at the individual page level.

Conclusion

It is very important to update the robots.txt file if the user is adding pages, files or directories to the site. This will provide the security to the website and the best possible results with SEO.

What’s your experience creating robots.txt files?  Let us know in the comments below.

Was this guide helpful? don’t forget to share this post 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.