Interesting studies on web topology [Archive] - RCE Messageboard's Regroupment

TBone

August 2nd, 2004, 16:01

I wasn't quite sure where to post this since there isn't really a general forum here. I figured that this is probably where it would do the most good since it pertains to the "physics" of searching the web. I think the article is vague enough about the actual searching process that I won't get banned for giving out too many details about our big top secret that you can search the web for information

I think this paper was written in 2000, so if it's old information that everyone already knew about, then I guess I'm just an idiot

. It's a research paper called "Graph structure in the web" by 8 researchers from AltaVista, IBM, and Compaq. They did an exhaustive study on the structure of the web from a hyperlinking standpoint, using a huge data set acquired from AltaVista. The paper itself is really detailed and includes mathematical models that I've grown far too rusty with to fully appreciate, but some of the most interesting conclusions include:

Quote:

In other recent work, [Albert, Jeong, and Barabasi 99] report the intriguing finding that most pairs of pages on the web are separated by a handful of links, almost always under 20, and suggest that this number will grow logarithmically with the size of the web. This is viewed by some as a "small world" phenomenon. Our experimental evidence reveals a rather more detailed and subtle picture: significant portions of the web cannot at all be reached from other (significant) portions of the web, and there is significant number of pairs that can be bridged, but only using paths going through hundreds of intermediate pages.

Quote:

Our analysis reveals an interesting picture (Figure 9) of the web's macroscopic structure. Most (over 90%) of the approximately 203 million nodes in our crawl form a single connected component if hyperlinks are treated as undirected edges. This connected web breaks naturally into four pieces. The first piece is a central core, all of whose pages can reach one another along directed hyperlinks -- this "giant strongly connected component" (SCC) is at the heart of the web. The second and third pieces are called IN and OUT. IN consists of pages that can reach the SCC, but cannot be reached from it - possibly new sites that people have not yet discovered and linked to. OUT consists of pages that are accessible from the SCC, but do not link back to it, such as corporate websites that contain only internal links. Finally, the TENDRILS contain pages that cannot reach the SCC, and cannot be reached from the SCC. Perhaps the most surprising fact is that the size of the SCC is relatively small -- it comprises about 56M pages. Each of the other three sets contain about 44M pages -- thus, all four sets have roughly the same size.

Quote:

We show that for randomly chosen source and destination pages, the probability that any path exists from the source to the destination is only 24%. We also show that, if a directed path exists, its average length will be about 16.

When you check out the model they finally wind up with, you can see that in general, only about 50% of the web (the core and the out nodes) is connected in a way that you would be able to find it by following links, assuming you had access to the core. And even more surprising - if you didn't have a search engine at all (and had to rely only on linking to find things on the web), you would probably only be able to reach 25% of it.

The other paper I ran across is called "Searching the World Wide Web" by Steve Lawrence and C. Lee Giles, for NEC Research. This is an earlier study from 1998, so it's getting pretty out of date, but some of the statistics may still hold true. Basically they concluded that individual search engines generally only covered 3-35% of the indexable web. They don't really define what they mean by "indexable", but I think the other paper gives you some ideea of what subset of pages that might include.

Okay, so it's probably antithetical, but:
http://www.almaden.ibm.com/cs/k53/www9.final
http://www.neci.nec.com/~lawrence/science98.html