This post is pure speculation, but they are notes of something I’ve been thinking about for a while.
Here’s the computer I want to get for “medium data” matrix computations ~50TB.
- Quad 8-core Intel E5-48XX processors + motherboard ($10k)
- 1TB memory. ($30k)
- 2x Mushkin Scorpion Deluxe 2TB PCIe Flash Drives ($5k)
- 45x 4TB Hard Drives + SATA expanders ($10k); 80TB after RAID 10.
- 2x Nvidia GPU of Xeon-Phi for extra horse power ($5k) as the CPUs are a little light.
In total, that’d be about $60k. The machine would be pretty hefty at processing data. It’d run circles around a Hadoop machine with 30 nodes (8 TB / node in 4x disks = 80 TB after replication) for any non-trivial task. The IO bandwidth from the flash cards is about 2GB/second each (byte, not bit). From the IO array, you should be able to get a little bit less. Each puts out about 100 MB/sec.
I’m not sure how to configure the raid array. It seems like RAID-5+ is frowned upon for arrays with large drives as the rebuild time is too long. So this would be RAID-10, I guess. This config is the part I’m least sure about.
Why not Hadoop? Hadoop is great for 1PB+ ETL tasks, parallel grep!, or anything else that looks like a “read enormous file and output small data” task. It’d also be a good way to pipe data into this mini computer where you could work on 50TB chunks and do something real with each of them.
That said, I still think MapReduce would be a great way to program that single machine. Something like Phoenix++ would make it pretty easy to take advantage of all that IO power and optimize it across all the cores and NUMA regions.
Why this system? Jim Demmel is right that communication is the dominant bottleneck of modern systems. This computer is designed to optimize the IO pathway to get as much data from ~50TB of secondary storage to main memory as quickly as possible. What to do once it’s there is up to you… You even have a small 4GB/sec write cache for intermediate results.