I’ve done quite a bit of work with Hadoop and MapReduce in the past year or so.
MapReduce offers an attractive, easy computational model that is data scalable. From the Dean and Ghemawat paper:
Programs written in this functional style are automati-
cally parallelized and executed on a large cluster of com-
modity machines. The run-time system takes care of the
details of partitioning the input data, scheduling the pro-
gram’s execution across a set of machines, handling ma-
chine failures, and managing the required inter-machine
communication. This allows programmers without any
experience with parallel and distributed systems to eas-
ily utilize the resources of a large distributed system.
The idea espoused here is that the simplicity of the programming model makes it easy for people to get-up-and-running with MapReduce and do parallel data processing. I have no doubt that this is true for what amounts to parallel log processing to build simple subsets. However, MapReduce should also be able to solve more complicated problems such as Google’s PageRank computation over the web-graph. When I’ve tried MapReduce for these tasks — or other things like computing TSQR factorizations, or even just data conversions — I find that it’s really slow. We are talking about hours to process a few terabytes of data on 10s of machines.
I’ve also read and reviewed many academic papers that basically hack around MapReduce’s limited communication and computational framework in order to do something faster. So this makes me think that MapReduce’s model is far to simple for most tasks. Given the lengths that people will go to (myself included!) in order to implement algorithms in MapReduce, what problem does it really solve?
I claim it’s the distributed IO problem.
Let me back up and explain why this is a problem. My educational training was all in standard HPC frameworks. These are almost exclusively driven by MPI. In 2004, I wrote a distributed MPI-based computation for Google’s PageRank vector on an MPI cluster that Yahoo! Research Labs had just purchased. Conceptually, this was actually pretty easy. I used the PETSc package from Argonne National Labs to do all the parallel communication; all I had to do was to load the graph into their internal storage format.
This single task, loading the data, took about 1000 lines of C code, and this was for reading from one machine and distributing to the entire cluster. It did not include any parallel IO! In MapReduce, this same task is zero lines of code. For a variety of file formats, this task is already done for you.
So my view on MapReduce is as follows:
The key success was using the map-stage to implement the parallel IO routine. The whole shuffle and reduce business ought to be a pluggable component. I would love to be able to “map” data from disk into an in-memory MPI job in order to blast through the problem on a small number of nodes.
I’d love to see this paradigm developed into a hybrid MapReduce/MPI hybrid system that I think could enable some really interesting work.