What is robots.txt?

robots.txt file for WordPress is a text file which contains instructions for the Web Crawlers (especially search engine robots) about how to crawl the pages on a particular site.

  • You can see this site robots.txt here.

 

robots.txt is a part of robots exclusion protocola standard used by websites to communicate with web robots. This file contains the instructions for web robots about pages to crawl and pages not to be considered. Web robots use this file to crawl the pages and further index them in search engines.

 

When webmaster(site owner) wishes to give instructions to web robots then they place the robots.txt file in the root directory of the website. This is easily reachable by the web robots as ‘www.example.com/robots.txt’.

  • In the absence of this file, web robots assumes that site owner doesn’t wants to provide any information and crawls entire site. robots.txt file is mainly used when webmaster has a large number of pages in their sites and they don’t want search engines to crawl some outdated pages.

 

What happens if web robots crawls outdated pages?

After the release of Google Panda, google focuses to index the sites which contains quality but not the quantity. If a site page is containing some outdated data then search engine will not index that page and this lowers down the entire site’s indexing parameters, resulting in less traffic.

So, Webmaster can restrict the web robots from crawling particular pages which they don’t want to be indexed. They can still have the pages. The quality of the restricted pages will not hamper the site indexing.

 

What is Google Panda?

google panda image

Google Panda is a search ranking algorithm used by Google to index the high-quality pages first. If site contains lots of information but low in quality then that site will rank lower in the search results.

The name Panda comes from Google engineer Navneet Panda (Indian origin) who created the technology, which helped Google to create this algorithm.

This algorithm targets whole site instead of a particular page of site. So in order to get better indexing, site should have rich content (articles providing quality).

 

Robots.txt file standard

Sites having multiple subdomains, must have separate robots.txt. For example,

  1. “www.example.com”
  2. “http://www.example.com”
  3. “https://www.wantextra.com”

All of these subdomains must have separate robots.txt. If “www.example.com” has robots.txt file but “https://www.example.com” does not, then rule will apply only for “www.example.com” but not for “https://www.example.com”.

robots.txt example

sample example of robots.txt

What is User-agent?

User-agent is a name of browser who wants to connect to the server. For example this can be understood like, User agent is a way to tell hi, I am mozilla firefox or hi! I am Google Chrome to the web server. When you send a page request, web server responds with a page view-able and compatible with your browser based on the information it got from User-Agent.

Similarly, when web robots visit any site, they have User-Agents specified for them. Based on their user agents names, we can alter the way of crawling site for particular web robots. If we want to tell all the robots to crawl all the files, we will write

user agent in robots.txt

 

What is crawl delay?

Based on Crawl-delay parameter, site owner can specify number of seconds for robots to wait before crawling next time.

This is mostly applicable for large websites like twitter, Facebook, etc where data keeps updating every second. If robots crawl the site every moment then server may get overloaded with too many requests at same time. By specifying Crawl-delay, we can avoid such situations.

NOTE: Google ignores Crawl-delay specified in robots file. If we want to control it for Google-bot, then settings are present in Google Webmaster dashboard in Site Settings section.

 

Some more variations of robots.txt

Blocking specific web robot from specific folder

User-agent: Googlebot
Disallow: /directory/
Crawl-delay: 20

This syntax restricts only Google robot from crawling “/directory/” folder.

 

Blocking web robot from all the files and folders

User-agent: *
Disallow: /
Crawl-Delay: 10

This syntax commonly refers robots.txt disallow all, which means neither of robots are allowed to visit any page on this site. And robots will have to wait next 10 seconds to make another request to crawl this file.

 

Blocking web robot from crawling specific page

User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html

This syntax restricts only Bingbot robot from crawling “/example-subfolder/blocked-page.html” page.

 

Blocking two web robots from crawling specific folder

User-agent: BadBot
User-agent: Googlebot
Disallow: /private/

 

Summary for robots.txt file
  • The file must be placed in a site’s top-level directory (root directory).
  • File name must be case sensitive. “robots.txt” is only allowed. (but Robots.txt or robots.Txt, and others are not allowed)
  • This file is publicly available. Just type www.example.com/robots.txt to see your site file.
  • Each subdomain on a root domain uses separate robots.txt file.

 

NOTE: It’s generally a good practice to place sitemap in robots.txt file like this,

User-agent: BadBot
User-agent: Googlebot
Disallow: /private/
Sitemap: www.example.com/sitemap.xml

Please share if the article was helpful.

Apoorv Sukumar

A blogger and a Software Developer, exploring trending technologies in market. A philomath, web explorer, who learns many things and want to deliver them to the world. Founder of wantextra.com

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

Enjoy this blog? Please spread the word :)

error: Content is protected !!