Why MapReduce is successful: it’s the IO!

I’ve done quite a bit of work with Hadoop and MapReduce in the past year or so.

MapReduce offers an attractive, easy computational model that is data scalable. From the Dean and Ghemawat paper:

Programs written in this functional style are automati-
cally parallelized and executed on a large cluster of com-
modity machines. The run-time system takes care of the
details of partitioning the input data, scheduling the pro-
gram’s execution across a set of machines, handling ma-
chine failures, and managing the required inter-machine
communication. This allows programmers without any
experience with parallel and distributed systems to eas-
ily utilize the resources of a large distributed system.

The idea espoused here is that the simplicity of the programming model makes it easy for people to get-up-and-running with MapReduce and do parallel data processing. I have no doubt that this is true for what amounts to parallel log processing to build simple subsets. However, MapReduce should also be able to solve more complicated problems such as Google’s PageRank computation over the web-graph. When I’ve tried MapReduce for these tasks — or other things like computing TSQR factorizations, or even just data conversions — I find that it’s really slow. We are talking about hours to process a few terabytes of data on 10s of machines.

I’ve also read and reviewed many academic papers that basically hack around MapReduce’s limited communication and computational framework in order to do something faster. So this makes me think that MapReduce’s model is far to simple for most tasks. Given the lengths that people will go to (myself included!) in order to implement algorithms in MapReduce, what problem does it really solve?

I claim it’s the distributed IO problem.

Let me back up and explain why this is a problem. My educational training was all in standard HPC frameworks. These are almost exclusively driven by MPI. In 2004, I wrote a distributed MPI-based computation for Google’s PageRank vector on an MPI cluster that Yahoo! Research Labs had just purchased. Conceptually, this was actually pretty easy. I used the PETSc package from Argonne National Labs to do all the parallel communication; all I had to do was to load the graph into their internal storage format.

This single task, loading the data, took about 1000 lines of C code, and this was for reading from one machine and distributing to the entire cluster. It did not include any parallel IO! In MapReduce, this same task is zero lines of code. For a variety of file formats, this task is already done for you.

So my view on MapReduce is as follows:

The key success was using the map-stage to implement the parallel IO routine. The whole shuffle and reduce business ought to be a pluggable component. I would love to be able to “map” data from disk into an in-memory MPI job in order to blast through the problem on a small number of nodes.

I’d love to see this paradigm developed into a hybrid MapReduce/MPI hybrid system that I think could enable some really interesting work.

8 Responses to Why MapReduce is successful: it’s the IO!

Paul Constantine says:

October 7, 2012 at 09:29

Hmmm. Hybrid MapReduce/MPI? Sounds like an idea for a proposal. Or a startup.

Indra says:

October 19, 2012 at 18:34

Hi David ! Sorry for the unrelated comment. Recenty I stumble upon your Phd thesis while browsing the web. I think it has very beautiful typeset. I wonder how you do that (what fonts, packages, etc). Do you mind sharing the latex template ? I’m now in the middle of writing my thesis and I want to make it look beautiful like yours. Thanks.

- dgleich says:
  
  October 20, 2012 at 02:01
  
  I’ll post about this in a article soon.
  
Indra says:

October 20, 2012 at 10:12

Great! Thanks for your response. I can’t wait to read that article.

aaked says:

December 17, 2012 at 08:16

Thanks for sharing!

Lasering says:

May 29, 2013 at 20:34

I don’t really know if you can get away with this but, you could just do an exec in the map, this exec would launch your MPI job. As far as the shuffle and reduce being pluggable, you can simply set the number of reducers to 0.

- dgleich says:
  
  May 29, 2013 at 22:55
  
  True, you could do that. But it’d be a big hack that no one else could really reproduce; you’d have scheduling nightmares as not all the map tasks get started at once … etc. I was hoping for something easier to use.
  
Diego says:

December 3, 2013 at 21:40

Have you checked out the Spark project? http://spark.incubator.apache.org/