Robots.txt (also referred to as the robots exclusion protocol) is a script that is used by web pages and online domains to communicate with search engine robots (commonly known as crawlers or spiders). Now before we discuss the contents of robots.txt, it is important to understand what the crawlers mentioned above are, what exactly they do, and what the relation between robots.txt and these crawlers is.
The first thing that you need to know about crawlers is that they, like most entities of the digital age, are just programs that have been designed by computer programmers for a particular purpose or job in mind. As far as these crawlers and spiders are concerned, their purpose is to assist search engines in their bid to index all web pages into their databases.
Due to the ridiculous enormity of the internet, it simply isn’t possible for a team of individuals to take on this daunting task by themselves. Hence, to make things easier for search engines, these spiders and crawlers were invented which are capable of visiting millions of websites on the internet every 24 hours. They are able to index their findings in the database of search engines so that they can be added to the search result pages of these search engines.
Now that we have discussed the very definition and purpose of search engine spiders, it’s time to analyze their relation with the robots exclusion protocol. We now know that the best search engines send out their crawlers to all web pages on the internet to index them into the search engine’s database. When these crawlers visit an online domain, the first thing that they search for is the robots exclusion protocol (robots.txt). This script (which is just a simple text file on the server of your web page) tells the crawler what parts of the web page to visit first and what parts of the website are forbidden.
An important point to note here is that the word ‘forbidden’ really doesn’t carry the weight that you’d expect. It is an open secret that if you put the ‘forbidden’ label on any portion of your online forum, the crawlers of popular search engines like Google and Bing are well within their rights to ignore this ‘forbidden’ tag and go on about their business as usual.
However, this can be very dangerous for people who have broken links or unoriginal content on their web pages because if a crawler finds something to be out of order on your online domain, it will report back to its parent search engine and add your flaws to the search engine’s database. This can inhibit the progress of your website and make it difficult for you to bring in more visitors on your online platform. Thus, it is important to make sure that your web page has proper search engine optimization (SEO) before it goes online because these crawlers can wreck your progress long before you even start.
See also : On-Page Seo Techniques You must Know
You may be thinking that the robots exclusion protocol isn’t worth the trouble because it isn’t very good at doing its job (i.e. keeping crawlers away from parts of your web page). However, that is simply not the case because while robots.txt may not be successful in keeping out search engine bots out of the forbidden sections on your server, it excels at keeping commercial bots at bay.
Thus, you should not use robots.txt to hide certain sections of your server as it can easily be overwritten and seen by malware robots that report back to spammers. Furthermore, as the robots.txt is a file that is publically available on your server, hiding confidential items using this script is the digital equivalent of telling a burglar where the jewel safe is in your house.
How to Write a Robots Exclusion Protocol
Now that we have understood what crawlers are, how they are helpful for programmers and search engines, and how they should and should not be used (with respect to robots.txt), it is time we turn our attention to the main problem i.e. how to write an effective robots exclusion protocol? Most people think that writing a robots.txt file is serious business and probably takes hours of effort put in on a very complex application. That simply isn’t the case as a robots exclusion protocol can be written by even the most basic of all text editors i.e. notepad and doesn’t take very long to write. All you need to make sure when making a robots.txt script is that you know exactly which parts of your web page do you want the search engine robots to visit and index and which search engine robots can actually have this privilege.
It is probably worth noting that if your online platform doesn’t have a robots.txt script, all sections of your server are considered to be fair game and will be visited and indexed by search engine robots, no matter what state they are in or how confidential data they hold. Thus, it is probably a good idea for you to pay attention as I teach you how to write a basic robots exclusion protocol.
Before we begin, it is important to understand that there are two main fields or elements in any robots exclusion protocol: User Agent and Disallow. The User-Agent field is basically used to address which sort of crawlers are welcome on the server and which are not. Most robots.txt scripts make use of the wildcard ‘*’ (asterisk) symbol to communicate with all bots that may find their way on your server. On the other hand, as its name suggests, the Disallow subfield is used to tell these crawlers and spiders whether they are welcome on your server or not. If there is nothing written in the Disallow field, every single part of your server is open for these bots to crawl and index.
Now that we know the contents of a robots exclusion protocol, it is time to see some examples and analyze how you can optimize your server’s potential with an efficiently written robots.txt script. The following are some examples of commonly used robots.txt files.
1. To Make Everything Accessible to All Crawlers
This script is only useful if you want all crawlers to access and index every single part of your server. If you think that you have something to hide, it is probably a good idea for you to ignore this script. As discussed above, to address all bots, we use the wildcard (*), so your robots exclusion protocol looks something like this.
2. To Ban Every Crawler from Every Section of Your Server
The exact opposite of what we learned to do above, if you want all bots to be forbidden from visiting and indexing your server or domain (for whatever domain), write the following robots.txt script.
3. To Allow Specific Robots Access to Everything on Your Server While Forbidding Every Other Crawler
This can be one of the most important tools in your arsenal as far as search engine optimization is concerned. If you only want a specific crawler (or crawlers) have access to all that is on your online domain while simultaneously stopping all other robots from doing so, this is what the script will look like.
It is clear from the picture above that for the first time, we haven’t used the wildcard ‘*’ in the User Agent subfield. If you’re wondering why that is so, then I can tell you that this subfield is used to address the bot crawling on your domain. For the purpose of this example, I am addressing Google’s crawler i.e. the Googlebot and telling it that all sections of my server are open to it. But if you look at the other half of this script, you will soon realize that it is the same as the example above (where I forbade all crawlers from all parts of my domain). Thus, it is possible to use different scripts and composite them together to achieve your ultimate goal.
4. To Forbid One Particular Crawler from All Parts on Your Server
If you want to stop a particular crawler for spidering or crawling on your server, then all you need to do is copy the following script into your robots.txt file. Note that for the purpose of this example, I have taken a hypothetical crawler called the XYZbot, but if you want to ban some other crawler, say Googlebot, from crawling on your server, all you need to do is replace the XYZbot with Googlebot, and you’re good to go.
This script is quite helpful in fending off those crawlers which you suspect are malware and are only on your server to wreak havoc. You won’t find a better and more effective tool in your arsenal that can stop malware bots from spamming your server completely.
5. To Block Crawlers from Indexing or Visiting Particular Sites and Directories on Your Server
If you feel like there is something on your server which you consider is not exactly crawler-friendly, you can use this script to fend off those pesky bots. However, always keep in mind that you should NEVER hide confidential or private stuff on your server with a simple robots.txt file.
As you can see from the example above, I have used the wildcard ‘*’ in the User-Agent field. This means that I’m banning all crawlers from visiting the temporary and junk directories on my server. If I wanted to block a particular bot from crawling on a particular directory, I would just replace the asterisk with the name of the said crawler.
6. To Stop All Crawlers from Indexing or Viewing All Files of a Particular Type
If you want to stop all crawlers from visiting your web page and indexing all the files of a particular type. Then all you need to do when writing a robots exclusion protocol is to enter the asterisk wildcard in the User Agent subfield and an asterisk, then a dot, and then the extension of the file. These three components should have no spaces between them as demonstrated in the example below.
As you can see, I have forbidden all crawlers from indexing all the MP3 files on my server. If I wanted to stop all bots on my server from indexing all .jpeg or .png files, I would have replaced the ‘mp3’ extension in the robots.txt file with the required extension. Similarly, as seen in the previous examples, I can also stop particular bots from indexing a particular file type by replacing the asterisk wildcard in the User Agent subfield with the name of the crawler that I don’t fancy. Also, it is also possible for me to stop crawlers from indexing a particular file (e.g. User Agent: *, Disallow: FolsomPrisonBlues.mp3). A combination of the two is also easily possible.
Robots Exclusion Protocol and Search Engine Optimization
It is important to understand the ramifications of the robots.txt file on your domain’s search engine optimization. There are important bits and pieces to consider when speaking of the impact of the robots exclusion protocol on your on-page SEO. The robots.txt script is a meta element that has the potential to cause a significant amount of damage to any given domain (if used carelessly).
For example, because of the improper use of the Disallow field, you can do something as stupid as stopping all handheld platform searches from visiting your web page by inhibiting all crawlers of handheld search engines (by the script User Agent: *, Disallow: /mobile/). This can be devastating to your search engine optimization as you stand to lose a significant portion of your web traffic if you’re not careful. Thus, when writing a robots exclusion protocol, it is important to keep both eyes open and make sure that every command written in the script is there for a justifiable reason. It may seem unlikely at first, but the robots.txt script can make or break your online platform.