Home Technology How does search engines works

How does search engines works

by Sangam Adhikari

Search engines work by creating snapshots of the Web, accessible by their crawlers. They can certainly create and store such snapshots that are very comprehensive.

The main point is that they index and store static pages while most of the stuff we see all the time on our screens is dynamic in nature and transitory and ephemeral. Think of all the ads, but also huge quantities of content behind online forms to generate queries, walled gardens such as Facebook where so much content is created and uploaded so fast that it is not feasible to access it by crawlers.

The only way to index it all in time would be for Facebook and others to provide feeds to Google and why would Facebook do that? Twitter experimented with such a feed earlier as a revenue stream but now their focus is firmly on advertising.

As of 2020, the Web has been actually stagnating for years:

The glory days of (exponential (!)) growth are long gone by now. This situation is being masked by increases in the Internet traffic but that is largely about serving ads with rich media content in higher and higher resolutions. There is more good stuff added to the main (Google) index but that is also offset but older stuff going stale and not being easily accessible, or not at all, by user queries where freshness is a big component in ranking.

The size of the main index has always been a guarded secret but it has been known for years that a good comprehensive estimate is about 100 billion pages, with about 10 billion so being the most important and relevant part. One can fit such an index in a few hundred terabytes which is nowadays about 10 servers. There is really no big need for sizable intermediate storage as all algorithms have to be linear in order to scale with the numbers we are talking about. That includes index construction and ranking; PageRank and link-based analysis converge fast, in about hundred or so iterations. It all runs in RAM and can be easily block-partitioned since it is linear in nature.

Alternate search engines such as DuckDuckGo and Gigablast confirm all that as they do not have the luxury of throwing petabytes around just for the sake of it.

The main reason search engines might need more servers is how much search traffic they have. A single search cluster described above might be able to serve a few hundred queries per second. For instance DuckDuckGo is at about 500 queries per sec as of Feb 2020 (see https://duckduckgo.com/traffic). Google has hundreds of times more, driving their needs for hundreds of petabytes. But that is for hundreds of copies of their index distributed all over the world in their datacenters for the quickest query responses.

In summary, search engines can certainly store a comprehensive index of accessible content on the entire Web. It is not all that big and has been actually stagnating for years now in 2020.

Leave a Comment