Tuesday, June 21, 2011

Robots.txt file and robots meta tags


In order to get a good ranking on different search engines you need to be aware to the importance of the robots.txt file. A lot of sites do not have the robots.txt file. This file regulates which parts of your site are closed for spiders and other search bots to visit. This file is useful to avoid unwanted search engine spiders like email searchers, image retrievers, etc. Robots.txt is also useful if you want to avoid access for some confidential information or some confidential files.

Robots.txt format and structure

Robots.txt file is a specific file that is should be located in the website root directory on the Web server. Robots.txt file restricts different Web Spiders (robots) from visiting of undesirable places on the website.

It tells them where on the site space they have permission to search and where do not. Actually a Robots.txt defines regulation for search engine robots where they have an access and where not. Indeed robots are not obligated to respect robots.txt files, but most of them comply with the regulations are defined in robots.txt. The robots.txt file should be written in special format. The robots.txt file consists of number of records. Each record combined from two fields: a "User-agent" field and one or more "Disallow:" fields. The format may be determined as: ":" If the robots.txt file is empty it the equal that it is not present at all. The robots.txt file should be created in Unix like editors with the line ender mode. (Windows editors normally end a line in a text file with a "/r" and "/n" character, while Unix uses an "/n" only, and Macs use a "/r" only.) Most Windows text editors have a Unix mode. When you transfer a file using ASCII transfer mode on FTP client programs it does a "translation" for the Unix mode.

The User-agent field describes the robot. Most popular search engines have special names for their robots. You can find all search engines user agent names in your logs by searching for requests to robots.txt.
For example :
User-agent: googlebot
If you want to specify all robots use the symbol "*" :
User-agent: *

The "Disallow" part of a record consists of directive lines. The robots.txt files format requires that at least one disallow line should be related for each User-agent field. These lines specify describe directories and files that should be not visited by the bot. For example, the following "Disallow" line restricts robots to download images directory: Disallow: /images/ And the following line instructs the robot to avoid visit the terms.html file: Disallow: terms.html. Which would block spiders from your cgi-bin directory. If you the Disallow line has not any symbol, it means that all files of your sitemay be downloaded.

The "robots" meta tags

An alternative to "robots.txt" is to use the robots meta tags in order to restrict robots access to some pages on your website.

Here's a list of the values you can specify within the "contents" attribute of these tags:

Value Description
(no)index Affects on whether robot should index this page or not. Possible values: "noindex" or "index"
(no)follow Affects on whether robot should follow links on this page and crawl them or not. Possible values: "nofollow" and "follow."
Unfortunately not all search engines can read meta tags. So the robots meta tags may be ignored. Therefore the preferable way to inform search engines about restrictions of some file visiting is to use a robots.txt file

No comments:

Post a Comment