Coverage - "volume index"

Information on the web today there are more than the time to index the search engines. This means that the informational chaos increases, and the existing approaches do not meet the growing information space. However, more resources than the corresponding profile includes a database system, the higher must be complete. It is this consideration explains the fierce competition for the amount of database indexes of web documents, ongoing since the beginning of the search engines. Such a database technology called IPS index of search engines.

5 years ago the major search engines of the world waged a fierce struggle is for this figure. The first pages of search sites like Altavista, Google, Alltheweb, Yahoo! published the figures - the number of indexed documents or volume index. At the beginning of the XXI century leaders to reach resources vybilas service Google. But in 2002, located today in the shadow system Alltheweb unexpectedly came to the first position to reach the network resources and, therefore, was recognized as the best network in the world for the IRS volume index, indexing 2.1 billion Web pages. Then the leader again made Google - more than 3.3 billion Web pages in 2003.

The latter figure, which was placed on the front page of Google, was just over 8 billion pages (the figure was given in 2005) then the figures no longer published, presumably, not for technical reasons, because naively assume that the owners of databases is not aware of their volumes. From the official press releases of the same in 2005 is well known that the volume of Google's index of 13 billion documents, the volume index of Yahoo! exceeded this value reached at that time 20 billion documents. Administration of Google did not agree with this figure, speaking with a refutation. However, data from the Google home page had been removed, though CEO Eric Schmidt at the same time, said: "The greater the index, the better the relevance and the more complete review." However, in a statement Yahoo! said: "We congratulate Google's removal from their home page numbers indicating the size of the index, and with the recognition that it does not mean anything. As we have said, it is only important that consumers have found that, that they are looking for, and we encourage users to compare the results of our systems. "

It would seem that the conflict was over, and to return to the assessment of the index, no one will. However, as time passed, and the world of search engines has flown the next scoop. At the end of July 2008, a new global search engine Cuil (Fig. 2) with a relatively small budget ($ 33 ​​million), which contains the index 121 billion web pages, which, according to experts, several times higher than the index of Google.

The roots of the new search engine are the same for Google. Creators of Cuil - Anna Patterson, her husband, Tom Costello, and several former employees of Google (among them Louis Monnet, one of the creators of AltaVista) - specialized in finding a very large scale databases. In particular, Patterson worked at Google, has registered the corresponding patent (Multiple Based Index Information Retrieval System).

Google immediately responded to a sensational statement Cuil, immediately stating that successfully indexed in a row trillionth web page. Of course, who can verify this? In general, this statement is very vague and just means that since the system handled it a trillion web pages.

The company said that the search engine has learned to seek out and remove duplicate pages from the index and pages with different addresses. "Start work on indexing began with the fact that spiders started to memorize the contents of the page and follow the hyperlinks that are present on these pages. The system must be constantly on the links going from site to site and remembering the content of pages studied. In reality, Google had indexed more than a trillion pages, but not all of them are unique stand-alone pages.. Many of them have more than one address, others are copies of each other, "- wrote in an official blog Nissan Hadzhaev, one of the developers of search engines. Today, it is said in the company, the completion of the index does not stop for a moment, but thanks to a distributed system, shortchanging and rapid updating of information the entire search index ranked redrawn several times per day.

Despite the huge size of the most powerful search engine today, Google, the volume of its current search index for some reason remains a mystery. We can only indirectly compare the Google and Cuil, asking them a simple query (Cuil information you can trust - its founders presented a searchable index to external experts). As is clear from the materials of the companies, both search engines do not use the so-called stop-vocabulary, ie requests for a simple, frequently used words will evaluate the ratio of the volumes of indexes. And such an assessment can make each! For example, type a search word "the" two systems simultaneously. We obtain:

Google: about 22,550,000,000 for the;

Cuil: 22,883,636,124 results for the.

The results are quite similar - it can be concluded about the same amount of search indexes. Introduce the word "for" (to check the Russian-speaking), we obtain:

Google: about 546,000,000 for to;

Cuil: 368,508,113 results for for.

Russian-speaking part of Google's index was slightly higher. The low capacity (volume) of Russian-speaking and indicate the index of Cuil searches for other words.

Such as would be staying, the results obtained, however, we introduce one word "of" for verification. We get an unexpected answer:

Google: about 22,760,000,000 for of;

Cuil: 121,000,000,000 results for of.

So, Cuil result of more than 5 times weightier. But, given the results of the search for the word "the" (and in other words, in particular, not only in English), you can make a different conclusion. Whatever the results of such comparisons, the fact remains: Google - the most popular search engine, the most expensive brand in the world, and Cuil - a little-known project with a budget of a regional search engine. Indeed, we can agree that the amount of search index solves a few.

© 2010 - 2019 D@nVitLabs