Many people think SEO is all about increasing exposure of websites to search engines by making websites easier to be crawled, increase back links so they will rank higher etc.
However, there are times when you shouldn’t expose your site to search engines, and many people tend to forget about this.
- if you have certain Search Engine Marketing pages specifically designed for paid search and not organic search, you don’t want those pages to be crawled by search engines
- if you have a test site on a staging/production server, you don’t want search engine to crawl those content and make them public to users
- if your site has a onsite search function that generates a lot of search result pages with dynamic URL parameters, you should block those pages because they are likely duplicate contet
- if you want to prevent the Admin section pages of a site to show up on search result
and the list can go on and on depending on your situation.
So how do you prevent search engines from crawling certain sections of the site or certain pages? You should apply Noindex in the robots meta tag.
How to apply Noindex meta tag?
It is pretty simple and very similar to other meta tags as you can see below. You can combine it with Nofollow to search engines will not follow and crawl the links that are on the page.
<html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
How about adding “Disallow” in Robots.txt? Wouldn’t that be good enough?
This is one of the common mistakes that people do. Adding “Disallow” as shown below in the Robots.txt will only prevent search engines from stop crawling the pages (or directories) you specifiy when they visit your site. If someone has already created a back link to your web pages on their website, search engines can still visit those pages and index them.
User-agent: * Disallow: /
Additionally, even if Google respect the Robots.txt code and doesn’t crawl your site, Google can still get information from other sites such as open directory to add your site on its search result. (Watch the video below for more information)
Apparently some websites such as BMW, NY times, Ebay used the Robots.txt to block their site at one point. I am guessing that Google did not like the fact that these pages were not showing up on their search results so they came up with this alternative method.
Are there other ways to prevent search engines from indexing my web pages?
I have found a really good video created by Matt Cutt who explains multiple methods and their pros and cons.
According to him, using .htaccess file and password protect pages (directories) is the best way to prevent search engines from crawling.
Conclusion for Robots.txt vs. Noindex Robots Meta Tag
If you want to be sure to not get your web pages from indexed by search engines, use combination of Robots.txt and adding Noindex / Nofollow in the robots meta tag is the way to go. You can also use the “Remove URLs” tool in the Google Webmaster Tools. Bing Webmaster Tools also provides a URL removal tool if you want to be thorough.