ROBOTS.TXT DISALLOW: 20 Years of Mistakes to Avoid
If you’re reading this article, you’re probably already familiar with robots.txt however, if you need a refresher, you’ll find the information provide here to be useful and a good reminder of the mistakes to avoid.
Formally known as the “Robots Exclusion Standards”, Robots.txt is the way a website communicates with the web crawlers and other web robots. The txt file contains brief instructions and steer the crawlers away from or to specific sections of the website. Robots are typically trained to search for robots.txt documents once they reach a website and comply with its directives. That said, there are robots that don’t adhere to this standard, including malware robots, spambots and email harvesters that don’t have good intentions when they land on your website.
A Brief History
In 1994 Martin Koster developed a web crawler that ended up causing a destructive case of the DDOS on his own servers. As a result of, the “Robots Exclusion Standards” were created in order to guide web crawlers and effectively block them from getting into certain areas. Over the years robots.txt files have evolved and can now contain more information and also have more uses.
The most important thing to keep in mind when it comes to robots.txt is that it can be responsible for making or breaking a website’s connection with search engines.
Not Disallowing URLs in Advance
Google used to only check robot.txt files once a week, upping it to once a day in 2000. At present, Google typically (but not always) checks robots.txt files every 24 hours. Regardless, it’s completely feasible for content that is disallowed by robots.txt to be crawled in between the gaps between robots.txt checks over the first 24 hours. What this means is that if you hope to keep URLs from being crawled by using robots.txt disallow, they will need to be added no less than 24 hours in advance.
Disallowing Confidential Information
The only way to keep search engines from being able to access confidential information online as well as putting it on display to users in search results pages is to put that content safely behind a login. Bottom line, don’t use robots.txt to block access to sensitive sections of your website – password protect them! There are numerous reasons why this should be done. For instance, humans and rogue bots that don’t respect the robots protocol will still be able to access disallowed areas if they are not password protected. You see, robots.txt is a file that is publicly accessible and everyone can see when someone is trying to hide something if it’s inserted within a disallow rule in robots.txt. So, if something needs to remain completely private don’t put it online.
Using a robots.txt has always been debated among webmasters because it can be a strong tool when well written. At the same time, you can end up shooting yourself in the foot with it. While the advantages of using a well written robots.txt file are impressive, including improved crawl speed and a reduced amount of useless content for crawlers, one little mistake can result in a lot of harm.
- Understanding Meaning of Web Hosting Terms and Jargon
- WordPress Security 101: Powerful Tips On Keeping Your WordPress Site Secure
- Set Up Google Authorship Markup-Get Your Photo In Google Search Results
- Google Panda 4.0 Update is Rolling Out From 21 May 2014
- Google update overview – You need to know this