Understanding Google's Search Engine Data Structures
Measuring Speed: How Fast is Fast?
When we talk about data structures and algorithms, we need a way to measure exactly how fast they are. Computer processors get faster over time, but that doesn’t mean that an algorithm that was invented in the 80s is better now, even if we CAN search many times faster!
In computer science, we measure algorithms by how many steps they take relative to the size of the input. To do this, we use what is called Big O (pronounced “big oh”) notation, which counts the number of steps and then throws away everything except the largest factor. To see what that means, let’s take an example.
Consider the problem of sorting a list of integers. One of the easiest ways (to program, at least) is called bubble sort. You compare the first two elements in the list, and swap them if the first one is higher. Then you compare the 2nd and 3rd elements in the list and swap them if element two is higher. You continue this way until you’ve worked your way through the entire list, at which point the largest element is at the end of the list. You then start over and do the same thing again, but stop at the 2nd to last position (which now contains the second largest element); continue the process until the entire list is sorted.
How long does this take? If the list has n elements, the first time through requires n-1 comparisons. The second time through requires n-2 comparisons, then n-3, and so on. If you add all of this up, you get (n2-n)/2 comparisons, or 1/2 n2 - 1/2 n. We throw away everything except the largest factor, and say that bubble sort runs in O(n2) time.
Other algorithms run much more quickly; for example, if we know that all of the numbers are between 1 and 100, we can use a bucket sort, which runs in linear, or O(n) time. For small sorts, we can’t tell which algorithm is faster - the constant factors we threw out (such as the 1/2 multiplier for bubble sort) may overpower everything else. But with larger sorts, the algorithm with the lower big-oh runtime will perform better. If we have a million items to sort, for example, bubble sort will take approximately a million times as long as bucket sort!
Sizing Your Filing Cabinet
Finding the best algorithm often requires figuring out the best data structure. Not only can a clever data structure save time (by not requiring as much information to be copied), but it can also save space; for example, an interval graph on n vertices may have _n_2 edges, but it can be represented as n pairs of endpoints.
When searching text, we often use a particular data structure called a DAWG, for Directed Acyclic Word Graph. To build this, we read in a text file backwards and create a tree structure that allows us to do a simple n-ary search (where n is the size of the alphabet) to find any desired string in the text. The interesting thing about this structure is that once you build the tree, you can find every instance of a search string in time proportional to the size of the string plus the number of times the string appears in the file, regardless of the size of the file itself; that is, if you were to search for the term “jabberwock” and it appeared three times in the text, you would have to do approximately 13 operations, no matter how long the text was.
How Google Works
Of course, indexing billions of documents, as Google does, is considerably more complicated than indexing a single file. Google’s search
engine data structures aren’t entirely public, of course; the company has released a considerable amount of information (and a few academic papers), but much of how the search works is considered proprietary information.
What we do know is that they use several different indexes to hold the information. While they have the entire internet (minus whatever’s been deindexed) backed up, they know that most people will only look at the first few results, which will largely come from a small subset of their data. As a result, when you do a search, they don’t actually return the entire 157,000 results, but only the first few dozen (which come from that subset); as you move deeper into the results, they bring up more as needed. A combination of fast hardware, advanced data structures, and only retrieving a few results at a time allows them to give an answer to your search in a mere fraction of a second.
About the Author & References
While working on his PhD in computer science at Colorado State University, William’s research interests included graph theory and algorithms. His advisor wrote the chapter on DAWGs in a major computer science textbook.
Image provided by Google - google.com
Information comes from the author’s own experiences.