Prohibition of file indexing. The fewer pages are indexed, the more traffic

23.06.2022 | Health

From the author: Do you have pages on your website that you don't want search engines to see? From this article you will learn in detail how to prevent page indexing in robots.txt, whether this is correct and how to generally block access to pages.

So, you need to prevent certain pages from being indexed. The easiest way to do this is in the robots.txt file itself, adding the necessary lines to it. I would like to note that we specified the folder addresses relative to each other, specify the URLs of specific pages in the same way, or you can specify an absolute path.

Let's say my blog has a couple of pages: contacts, about me and my services. I wouldn't want them to be indexed. Accordingly, we write:

User-agent: * Disallow: /kontakty/ Disallow: /about/ Disallow: /uslugi/

Another variant

Great, but this is not the only way to block the robot’s access to certain pages. The second is to place a special meta tag in the html code. Naturally, place only in those records that need to be closed. It looks like this:

< meta name = "robots" content = "noindex,nofollow" >

The tag must be placed in the head container in the html document to work correctly. As you can see, it has two parameters. Name is specified as a robot and specifies that these directions are intended for web crawlers.

The content parameter must have two values, separated by commas. The first is a ban or permission to index text information on the page, the second is an indication of whether to index links on the page.

Thus, if you want the page not to be indexed at all, specify the values noindex, nofollow, that is, do not index the text and prohibit following links, if any. There is a rule that if there is no text on a page, then it will not be indexed. That is, if all the text is closed in noindex, then there is nothing to be indexed, so nothing will be included in the index.

In addition, there are the following values:

noindex, follow – prohibition of text indexing, but permission to follow links;

index, nofollow – can be used when the content should be taken into the index, but all links in it should be closed.

index, follow – default value. Everything is permitted.

The robots.txt file is a set of directives (a set of rules for robots) with which you can block or allow search robots to index certain sections and files of your site, as well as inform additional information. Initially, using robots.txt it was only possible to prohibit indexing of sections; the ability to allow indexing appeared later and was introduced by search leaders Yandex and Google.

Robots.txt file structure

First, the User-agent directive is written, which shows which search robot the instructions apply to.

A small list of well-known and frequently used User-agents:

User-agent:*
User-agent: Yandex
User-agent: Googlebot
User-agent: Bingbot
User-agent: YandexImages
User-agent: Mail.RU

Next are the Disallow and Allow directives, which prohibit or allow indexing of sections, individual site pages or files, respectively. Then we repeat these steps for the next User-agent. At the end of the file there is a Sitemap directive, which specifies the address of your site map.

When writing Disallow and Allow directives, you can use the special characters * and $. Here * means "any character" and $ means "end of line". For example, Disallow: /admin/*.php means that indexing of all files that are in the admin folder and ending with .php is prohibited, Disallow: /admin$ prohibits the address /admin, but does not prohibit /admin.php, or / admin/new/ , if available.

If all User-agents use the same set of directives, there is no need to duplicate this information for each of them; User-agent: * will suffice. In the case when it is necessary to supplement the information for one of the user-agents, you should duplicate the information and add a new one.

Example robots.txt for WordPress:

*Note for User agent: Yandex

Checking robots.txt

Old version of Search console

To check the correctness of robots.txt, you can use Webmaster from Google- you need to go to the “Crawling” section and then “View as Googlebot”, then click the “Receive and display” button. As a result of the scan, two screenshots of the site will be presented, showing how the site is viewed by users and how search robots see it. Below you will see a list of files, the prohibition of indexing of which prevents the correct reading of your site by search robots (they will need to be allowed for indexing by the Google robot).

Typically these can be various stylesheet files (css), javascript, and images. After you allow these files to be indexed, both screenshots in Webmaster should be identical. Exceptions are files that are located remotely, for example, a Yandex.Metrica script, buttons social networks etc. You won’t be able to block/allow them for indexing. For more information on how to resolve the “Googlebot can’t access CSS and JS files on the site” error, read our blog.

New version of Search console

IN new version There is no separate menu item for checking robots.txt. Now you just need to paste the address of the desired country into the search bar.

In the next window, click “Examine the scanned page.”

In the window that appears, you can see resources that, for one reason or another, are inaccessible to the Google robot. IN specific example there are no resources blocked by the robots.txt file.

If there are such resources, you will see messages like the following:

Each site has a unique robots.txt, but some common features can be identified in the following list:

Block authorization, registration, remember password and other technical pages from indexing.
Admin panel of the resource.
Sorting pages, pages for displaying information on the site.
For online stores, cart pages, favorites. You can read more in the advice for online stores on indexing settings on the Yandex blog.
Search page.

This is just an approximate list of what can be blocked from indexing by search engine robots. In each case, you need to understand individually, in some situations there may be exceptions to the rules.

Conclusion

The robots.txt file is an important tool for regulating the relationship between the site and the search engine robot; it is important to spend time setting it up.

The article contains a large amount of information about Yandex and Google robots, but this does not mean that you need to create a file only for them. There are other robots - Bing, Mail.ru, etc. You can supplement robots.txt with instructions for them.

Many modern cms create a robots.txt file automatically, and they may contain outdated directives. Therefore, after reading this article, I recommend checking the robots.txt file on your website, and if they are present there, it is advisable to delete them. If you don't know how to do this, please contact

Robots.txt is a service file that serves as a recommendation for restricting access to the content of web documents for search engines. In this article we will look at setting up Robots.txt, describing the directives and composing it for popular CMSs.

Located this file The robot is in the root directory of your site and is opened/edited with a simple notepad, I recommend Notepad++. For those who don’t like to read, there is a VIDEO, see the end of the article 😉

Why do we need robots.txt?

As I said above, using the robots.txt file we can limit the access of search bots to documents, i.e. we directly influence the indexing of the site. Most often they are blocked from indexing:

Service files and CMS folders
Duplicates
Documents that are not useful to the user
Not unique pages

Let's look at a specific example:

An online store selling shoes is implemented on one of the popular CMS, and not in the best way. I can immediately tell that the search results will include search pages, pagination, a shopping cart, some engine files, etc. All of these will be duplicates and service files that are useless to the user. Therefore, they should be closed from indexing, and if there is also a “News” section in which various interesting articles from competitors’ sites are copied and pasted, then there is no need to think about it, we close it right away.

Therefore, we make sure to create a robots.txt file so that no garbage gets into the results. Don’t forget that the file should be opened at http://site.ru/robots.txt.

Robots.txt directives and configuration rules

User-agent. This is an appeal to a specific robot search engine or to all robots. If a specific robot name is specified, for example “YandexMedia”, then general user-agent directives are not used for it. Writing example:

User-agent: YandexBot Disallow: /cart # will only be used by the main Yandex indexing robot

Disallow/Allow. This is a prohibition/permission to index a specific document or section. The order of writing does not matter, but if there are 2 directives and the same prefix, “Allow” takes precedence. The search robot reads them by the length of the prefix, from smallest to largest. If you need to disable indexing of a page, simply enter the relative path to it (Disallow: /blog/post-1).

User-agent: Yandex Disallow: / Allow: /articles # We prohibit site indexing, except for 1 section articles

Regular expressions with * and $. An asterisk means any sequence of characters (including empty ones). The dollar sign means interruption. Examples of using:

Disallow: /page* # prohibits all pages, constructions http://site.ru/page Disallow: /arcticles$ # prohibits only the page http://site.ru/articles, allowing pages http://site.ru/articles /new

Sitemap directive. If you use it, then in robots.txt it should be indicated like this:

Sitemap: http://site.ru/sitemap.xml

Host directive. As you know, sites have mirrors (we read,). This rule points the search bot to the main mirror of your resource. Refers to Yandex. If you have a mirror without WWW, then write:

Host: site.ru

Crawl-delay. Sets the delay (in seconds) between the bot downloading your documents. It is written after the Disallow/Allow directives.

Crawl-delay: 5 # timeout in 5 seconds

Clean-param. Indicates to the search bot that there is no need to download additional duplicate information (session identifiers, referrers, users). Clean-param should be specified for dynamic pages:

Clean-param: ref /category/books # we indicate that our page is the main one, and http://site.ru/category/books?ref=yandex.ru&id=1 is the same page, but with parameters

Main rule: robots.txt must be written in lowercase and located in the root of the site. Example file structure:

User-agent: Yandex Disallow: /cart Allow: /cart/images Sitemap: http://site.ru/sitemap.xml Host: site.ru Crawl-delay: 2

Meta robots tag and how it is written

This option for banning pages is better taken into account by the Google search engine. Yandex takes both options into account equally well.

It has 2 directives: follow/nofollow And index/noindex. This is permission/prohibition of following links and permission/prohibition of document indexing. Directives can be written together, see the example below.

For any individual page you can write in the tag following:

Correct robots.txt files for popular CMS

Example Robots.txt for WordPress

Below you can see my version from this SEO blog.

User-agent: Yandex Disallow: /wp-content/uploads/ Allow: /wp-content/uploads/*/*/ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow : /template.html Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: / wp-trackback Disallow: /wp-feed Disallow: /wp-comments Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /tag Disallow: /archive Disallow: */trackback/ Disallow: */feed/ Disallow: */comments/ Disallow: /?feed= Disallow: /?.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /template.html Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-trackback Disallow: /wp-feed Disallow: /wp-comments Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /tag Disallow: /archive Disallow: */trackback/ Disallow: */feed/ Disallow: */comments/ Disallow: /?feed= Disallow: /?. xml

I prohibit trackbacks because it duplicates a piece of the article in the comments. And if there are a lot of trackbacks, you will get a bunch of identical comments.

I try to close the service folders and files of any CMS, because... I don’t want them to be included in the index (although search engines don’t take them anyway, but it won’t be any worse).

Feeds should be closed, because These are partial or complete duplicate pages.

We close tags if we don’t use them or if we’re too lazy to optimize them.

Examples for other CMS

To download the correct robots for the desired CMS, simply click on the appropriate link.

Whenever a site is accessed, search robots first look for and read the robots.txt file. It contains special directives that control the behavior of the robot. A hidden danger for any site can come from both the absence of this file and its incorrect configuration. I propose to study in more detail the issue of setting up robots.txt in general and for the WordPress CMS in particular, and also pay attention to common errors.

Robots.txt file and robot exception standard

All search engines understand instructions written in a special file according to the robot exclusion standard. For these purposes, a regular text file called robots.txt is used, located in the root directory of the site. If placed correctly, the contents of this file can be viewed on any website by simply adding /robots.txt after the domain address. For example, .

Instructions for robots allow you to prohibit scanning files/directories/pages, limit the frequency of access to the site, specify a mirror and an XML map. Each instruction is written on a new line in the following format:

[directive]: [value]

The entire list of directives is divided into sections (entries), separated by one or more empty lines. A new section begins with one or more User-agent instructions. The entry must contain at least one User-agent and one Disallow directive.

Text after the # (hash) symbol is considered a comment and is ignored by search robots.

User-agent directive

User-agent— the first directive in the section, reports the names of the robots for which the following rules are intended. The asterisk in the value denotes any name, only one section with instructions for all robots is allowed. Example:

# instructions for all robots User-agent: * ... # instructions for Yandex robots User-agent: Yandex ... # instructions for Google robots User-agent: Googlebot ...

Disallow directive

Disallow— a basic directive that prohibits scanning URLs/files/directories whose names fully or partially match those specified after the colon.

Advanced search robots like Yandex and Google understand the special character * (asterisk), which denotes any sequence of characters. It is not advisable to use substitution in a section for all robots.

Examples of the Disallow directive:

# empty value allows all files and/or directories starting with the characters "wp-" to be crawled User-agent: * Disallow: : /wp- # prohibits scanning files page-1.php, page-vasya.php, page-news-345.php # any sequence of characters can be used instead of * User-agent: * Disallow: /page-*.php

Allow directive (unofficial)

Allow Allows scanning of specified resources. Officially, this directive is not included in the robot exception standard, so it is not advisable to use it in the section for all robots (User-agent: *). An excellent example of use is to allow resources from a directory that was previously prohibited from indexing by the Disallow directive to be crawled:

# prohibits scanning resources starting with /catalog # but allows scanning the page /catalog/page.html User-agent: Yandex Disallow: /catalog Allow: /catalog/page.html

Sitemap (unofficial)

Sitemap— a directive indicating the sitemap address in XML format. This directive is also not described in the exception standard and is not supported by all robots (works for Yandex, Google, Ask, Bing and Yahoo). You can specify one or more cards - all will be taken into account. Can be used without User-agent after an empty line. Example:

# one or more maps in XML format, the full URL is indicated Sitemap: http://sitename.com/sitemap.xml Sitemap: http://sitename.com/sitemap-1.xml

Host directive (Yandex only)

Host— a directive for the Yandex robot, indicating the main mirror of the site. The issue of mirrors can be studied in more detail in the Yandex help. This instruction can be indicated either in the section for Yandex robots or as a separate entry without a User-agent (the instruction is cross-sectional and in any case will be taken into account by Yandex, and other robots will ignore it). If Host is specified several times in one file, only the first one will be taken into account. Examples:

# specify the main mirror in the section for Yandex User-agent: Yandex Disallow: Host: sitename.com # main mirror for a site with an SSL certificate User-agent: Yandex Disallow: Host: https://sitename.com # or separately without User- agent after the empty line Host: sitename.com

Other directives

Yandex robots also understand the Crawl-delay and Clean-param directives. Read more about their use in the help documentation.

Robots, robots.txt directives and search engine index

Previously, search robots followed the directives of robots.txt and did not add resources “prohibited” there to the index.

Today things are different. If Yandex obediently excludes from the index the addresses prohibited in the robots file, then Google will act completely differently. He will definitely add their index, but the search results will contain the inscription “ The web page description is not available due to restrictions in the robots.txt file.".

Why does Google add pages that are prohibited in robots.txt to the index?

The answer lies in a little Google trick. If you carefully read the webmaster help, everything becomes more than clear:

Google shamelessly reports that directives in robots.txt are recommendations, not direct commands to action.

This means that the robot takes the directives into account, but still acts in its own way. And he can add a page to the index that is prohibited in robots.txt if he encounters a link to it.

Adding an address to robots.txt does not guarantee that it will be excluded from Google's search engine index.

Google index + incorrect robots.txt = DUPLICATES

Almost every guide on the Internet says that closing pages in robots.txt prevents them from being indexed.

This was the case before. But we already know that such a scheme does not work for Google today. And what’s even worse is that everyone who follows such recommendations makes a huge mistake - closed URLs end up in the index and are marked as duplicates, the percentage of duplicate content is constantly growing and sooner or later the site is punished by the Panda filter.

Google offers two really workable options for excluding a website from its resource index:

closing with a password(applies to files like .doc, .pdf, .xls and others)
adding a robots meta tag with the noindex attribute V (applies to web pages):

The main thing to consider:

If you add the above meta tag to a web page that prohibits indexing, and additionally prohibit crawling of the same page in robots.txt, then Google robot will not be able to read the prohibited meta tag and will add the page to the index!
(that’s why he writes in the search results that the description is limited in robots.txt)

You can read more about this problem in Google Help. And there is only one solution here - open access to robots.txt and configure a ban on indexing pages using a meta tag (or password, if we are talking about files).

Robots.txt examples for WordPress

If you carefully read the previous section, it becomes clear that Today you should not practice excessive banning of addresses in robots.txt, at least for Google. It is better to manage page indexing through the robots meta tag.

Here is the most banal and at the same time completely correct robots.txt for WordPress:

User-agent: * Disallow: Host: sitename.com

Surprised? Still would! Everything ingenious is simple 🙂 On Western resources, where there is no Yandex, recommendations for compiling robots.txt for WordPress come down to the first two lines, as shown by the authors of WordPress SEO by Yoast.

A properly configured SEO plugin will take care of canonical links and the robots meta tag with the value noindex, and the admin pages are password-protected and do not need to be blocked from indexing (the only exception can be the login and registration pages on the site - make sure that they have a robots meta tag with the value noindex). It is better to add a sitemap manually in the search engine webmaster and at the same time make sure that it is read correctly. The only thing left that is important for RuNet is to indicate the main mirror for Yandex.

Another option, suitable for the less daring:

User-agent: * Disallow: /wp-admin Host: sitename.com Sitemap: http://sitename.com/sitemam.xml

The first section prohibits indexing for all robots of the wp-admin directory and its contents. The last two lines indicate a site mirror for the Yandex robot and a site map.

Before changing your robots.txt...

If you decide to change the directives in robots.txt, then first take care of three things:

Make sure that there are no additional files or directories in the root of your site whose contents should be hidden from scanning (these could be personal files or media resources);
Turn on canonical links in your SEO plugin (this will exclude URLs with query parameters like http://sitename.com/index.php?s=word)
Set up robots meta tag output with meaning noindex on pages that you want to hide from indexing (for WordPress these are archives by date, tag, author and pagination pages). This can be done for some pages in the SEO plugin settings (All In One SEO has incomplete settings). Or display it yourself using a special code: /* ========================================================== ================================ * Add your * ================================================================= ========================= */ function my_meta_noindex () ( if (//is_archive() OR // any archive pages - for a month, for year, by category, by author //is_category() OR // archives of categories is_author() OR // archives of articles by author is_time() OR // archives of articles by time is_date() OR // archives of articles by any dates is_day( ) OR // archives of articles by day is_month() OR // archives of articles by month is_year() OR // archives of articles by year is_tag() OR // archives of articles by tag is_tax() OR // archives of articles for a custom taxonomy is_post_type_archive () OR // archives for a custom post type //is_front_page() OR // static home page //is_home() OR // main blog page with the latest posts //is_singular() OR // any post types - single posts, pages, attachments, etc. //is_single() OR // any single post of any type of post (except attachments and Pages) //is_page() OR // any single Page (“Pages” in the admin panel) is_attachment() OR // any attachment page is_paged() OR // any and all pagination pages is_search() // site search results pages) ( echo ""." "."\n"; ) ) add_action("wp_head", "my_meta_noindex", 3); /* ========================== ================================================================= */
In lines starting with // the meta tag will not be displayed (each line describes which page the rule is intended for). By adding or removing two slashes at the beginning of a line, you can control whether the robots meta tag will be displayed or not on a certain group of pages.

In a nutshell what to close in robots.txt

When setting up the robots file and indexing pages, you need to remember two important points that put everything in its place:

Use the robots.txt file to control access to server files and directories. The robots.txt file plays the role of an electronic sign “No entry: private territory”

Use the robots meta tag to prevent content from appearing in search results. If a page has a robots meta tag with a noindex attribute, most robots will exclude the entire page from search results, even if other pages link to it.