How can Google let such obvious crap rank?
I’ve had those moments.
I’ve read even more forum rants about competitors ranking using tactics that go against Google guidelines, tactics that Google is supposed to detect now.
I’ve even seen a network of thousands auto-generated spam sites on new domains interlinking on totally unrelated topics simply dominate, as in 90-100 percent of the top 20 spots for thousands of long-tail queries.
Why does this work?
Google says they don’t let new domains rank.
Google says they can spot link networks.
Google says large spikes in links get filtered out.
Google says links between unrelated sites get filtered out.
So has Matt Cutts been lying to you all along?
Before running off to the forums to praise directory links and proclaim yourself an SEO genius because you found a site ranking only with those links, you need to understand that there are two Googles.
Smart Google and Dumb Google.
And the search results are a combination of the two.
To understand why, you need a basic understanding of the technologies that underpin Google Search.
However, taking the time to understand the underpinnings of search will help you explain to clients why they are being outranked by spam or help you avoid shady agencies that will only benefit your sites short-term.
Batch Processing with MapReduce
The first really big challenge of building a search engine is how do you process data so massive that the best machines can’t even come close to handling the load required.
Easy, split it up.
That is basically what MapReduce is all about. It is a system for splitting up massive computational tasks into small portions, balancing the tasks across many machines and recovering from machine failure.
MapReduce is a system for 1. sorting data (Map) and then 2. merging the sorted data while removing duplicates (Reduce). Google abstracted a lot of complex web-scale processing into a series of MapReduces.
MapReduce is optimised for processing data on the scale of the entire web all at one time.
And therein lies the problem.
It is great when you need to process massive amounts of data. But not suitable for processing a chunk of the data.
The problem, as described by Google Engineers Daniel Peng and Frank Dabek in their paper on Percolator titled Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations, is that “MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efﬁciency.”
Peng and Dabek go on to say:
“It’s not sufﬁcient to run the MapReduces over just the new pages since, for example, there are links between the new pages and the rest of the web. The MapReduces must be run again over the entire repository, that is, over both the new pages and the old pages.”
Up until 2010, if a web-scale computation could be split into a MapReduce then that’s how Google split it up.
Search Index? MapReduce.
Link Graph? MapReduce.
Toolbar PR? MapReduce.
And we’re not talking about a single MapReduce to generate the index but rather a chain of MapReduces processing the entire web with each step needing to be completed on all of the distributed machines before the next MapReduce can begin.
From Peng and Drabek:
“Each day we crawled several billion documents and fed them along with a repository of existing documents through a series of 100 MapReduces.”
All of these MapReduce computations had to compete for the same processing resources so, if a new index needed to be generated, then all of the less important tasks like toolbar PR were put on hold.
Google couldn’t just edit the index because every page it adds changes the link-graph.
So they had to reprocess the entire web every time they wanted to update the index. And that required a lot of processing resources.
Google has moved on from MapReduce for their web index, but that doesn’t mean that it is not around anymore. It is still the best solution for processing web-scale data all at once.
Incremental Processing with Percolator
In 2010 Google introduced Caffeine which was underpinned by a replacement for MapReduce called Percolator.
Percolator let Google replace the large batch reprocessing of MapReduce with a more firehose-like system that lets them make incremental changes.
“We built Percolator to create Google’s large “base” index, a task previously performed by MapReduce. In our previous system, each day we crawled several billion documents and fed them along with a repository of existing documents through a series of 100 MapReduces. The result was an index which answered user queries. Though not all 100 MapReduces were on the critical path for every document, the organization of the system as a series of MapReduces meant that each document spent 2-3 days being indexed before it could be returned as a search result.”
Percolator was designed specifically for Google’s core index.
Unlike in the MapReduce paper, which mentions multiple applications and is currently the foundation of Facebook among other large sites, the Percolator paper uses exclusively indexing examples.
It is safe to assume that the link graph is part of the core index based on Peng and Dabek’s mention of how “each link is also inverted so that the anchor text from each outgoing link is attached to the page the link points to” is an obstacle to quickly indexing new content with MapReduce in their Percolator paper.
What This Means for Smartass SEOs
Thanks for the history lesson Gramps. New tell my why I care.
The Caffeine update switched indexing to Percolator. A whole host of other processes were no doubt left to be run on the old systems.
Every month or so since Panda launched, there would be an update. Sites would drop or recover.
Pretty much any time you see periodic big changes in Google, you can assume it is because whatever affected the change is underpinned by MapReduce and you would probably be right most of the time.
But Google representatives further confirmed that Panda is underpinned by MapReduce or a MapReduce-like process in the Inside Search Blog’s 50 Changes for March post:
“Like many of the changes we make, aspects of our high-quality sites algorithm depend on processing that’s done offline and pushed on a periodic cycle.”
Sounds an awful lot like MapReduce but it doesn’t matter if it is or not.
We just need to establish that some processes are run offline and pushed periodically while indexing is incremental and immediate.
The difference between the two is the difference between Dumb Google that ranks new documents quickly and Smart Google that critically evaluates ranking signals to see if they should actually count.
Google admits in the Percolator paper that Percolator is not always the answer for web-scale processing:
“Computations where the result can’t be broken down into small updates (sorting a file, for example) are better handled by MapReduce.”
So when a new site is ranking using a spammy SEO tactic, it most likely means that the process to devalue that tactic runs offline on MapReduce and it will continue to rank until the next time Google can spare the resources to run the process.
There are two Googles.
Dumb Google that ranks pages based on links and a select few other signals, but does so very quickly.
And Smart Google that critically evaluates ranking signals to see if they should actually count, but at at the cost of only evaluating each signal periodically.
Image Credit: By Tkgd2007 (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons