What part of the archived web



/ > the time Machine Internet Archive is the largest and well-known archive that preserves web pages since 1995. Besides him there are a dozen other services, which are also archived to the web: the indexes of search engines and industry-specific archives like Archive-It, UK Web Archive, Cite Web, ArchiefWeb, Diigo. Interesting to know how many web pages gets into these files, relative to the total number of documents in the Internet?

It is known that the base of the Internet Archive for 2011 contains more than 2.7 billion URIs, many of them in multiple copies taken at different points in time. For example, the home page of Habra "photographed" already 518 times since July 3, 2006.

It is also known that the base reference Google five years ago overstepped the mark in trillion unique URLS, although many of the documents were duplicated. Google is not able to perform all URLS, so the company decided to count the number of documents in the Internet are endless.

As an example, "infinity" results in Google web calendar. No need to download and index all of its pages on millions of years ahead, because each of the pages is generated on demand.

However, scientists are interested at least roughly know which part of the web is archived and preserved for future generations. Until now nobody could answer this question. Experts from Old Dominion University in Norfolk spent a study and got a rough estimate.

For data processing they used web-framework Memento, which operates on the following concepts:

    the
  1. URI-R for identifying address of the original resource.
  2. the
  3. URI-M to identify the archived status of this resource at time t.

Accordingly, each URI-R can be zero or more States URI-M.

From November 2010 to January 2011 continued the experiment for determination of the share of publicly accessible pages, which are in the archives. Since the number of URIs on the Internet endlessly (see above), it was necessary to find an acceptable sample, which is representative for the entire web. Here, the researchers used a combination of several approaches:

    the
  1. sample of the Open Directory Project (DMOZ).
  2. the
  3. Random sample of URIs from the search engines, as described in the work of Ziv Bar-Joseph and Maxim Gurevich, “Random sampling from a search engine''s index” (Journal of the ACM (JACM), 55(5), 2008).
  4. the Last of the URIs added to the social bookmarking website Delicious using a generator Delicious Recent Random URI Generator. the

  5. link shortening Service Bitly, links, selected with the help of the generator hashes.

For practical reasons, the size of each sample has restricted thousands of addresses. The results of the analysis are shown in the summary table for each of the four samples.



The study showed that from 35% to 90% of all URIs on the Internet have at least one copy in the archive. From 17% to 49% URI have 2 to 5 copies. From 1% to 8% URI "photographed" 6-10 times, and from 8% to 63% URI — 10 and more times.

With relative confidence we can say that not less than 31.3% of URIs are archived once a month or more often. At least 35% of all pages in the archives have at least one copy.

Of course, the above figures do not belong to the so-called Deep web (the Deep Web), which is used to refer to dynamically generated pages from a database, password-protected directories, social networks, paid archives of Newspapers and magazines, flash websites, digital books, and other resources hidden behind the firewall, in the closed access and/or not available for indexing by search engines. some estimates, a Deep network can be several orders of magnitude larger than the surface layer.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Automatically create Liquibase migrations for PostgreSQL

Vkontakte sync with address book for iPhone. How it was done