MapReduce Reading

For my 8 hour trip to Kalamazoo last week, I printed some Google white papers for some “light reading”. One of these was MapReduce: Simplified Data Processing on Large Clusters, which was recently updated. I read the original version last year and wanted to catch up. From the paper:

Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Meh. 20 petabytes? You should see the JAR files in *my* app, I tell ya…

Seriously, I did not read that article and think…”hmm…that’s kind of like a database.” I cannot imagine anyone thinking that. Nor is MapReduce an index. You can use it to *create* an index, for example. But still…MapReduce does not compete with a database in any way. It is entirely different, for an entirely different kind of problem. Yet, we have the DeWitt/Stonebraker article, described next.

More Interesting Reading

The punchline is that upon my triumphant return to O’Fallon, MO, I discover everyone’s blogging about MapReduce. OK, that’s just a weird coincidence. I think this unbelievably inaccurate article written by David J. DeWitt and Michael Stonebraker (everybody’s saying they are database experts…) sparked a lot of the debate. It’s generally not cool to link to really awful material, but this one is worth it for the sheer entertainment factor. I recommend you read in this order:

The most telling part of all this is the fact that the original authors do not participate in the comments, at all. They are being ripped to pieces — FOR GOOD REASON — and they say nothing.

David J. DeWitt and Michael Stonebraker should retract their highly inaccurate article.


4 Responses to “MapReduce Reading”

Jacques Says:

The so called “database experts” (who is not an expert today? to make a living of IT you need to be an expert on something…) missed their target, I guess.

If they wanted to attract the attention of the blog community then their target should have been bigtable, for example, a distributed storage system that somewhat resembles (but it’s not!) a database system. But there are fewer people that have heard of bigtable so they’ve chosen the most prominent google internal piece of infrastructure. What a pity… ;-)

Alex Miller Says:

Our CTO had a few things to say on this too at MapReduce vs. the RDBMS.

Matt Taylor Says:

I’m glad opened the PDF, because I was thinking MapReduce had something to do with geographical mapping. :)

Don Richter Says:

David J. DeWitt and Michael Stonebraker have posted a response to the critiques of their original article.

Leave a Reply