Abstract
The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards heaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private “clouds”. At t he same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis.A cloud OS is responsible for managing the cloud resources and its gives a high level interface to the application programmers in order to hide the infrastructure details. We describe Cloud MapReduce, an implementation of the MapReduce programming model on top of the Amazon cloud OS, which exploits the scalability offered by the cloud OS. Cloud MapReduce enjoys the inherit scalability and resiliency, which greatly simplies its architecture. Cloud MapReduce doesn’t need to design central coordinator components (like the NameNode and JobTracker in the Hadoop environment). They simply store the job progress status information in the distributed metadata store (SimpleDB). Cloud MapReduce doesn’t need to worry about scalability in the communication path and how data can be moved efficiently between nodes, all is taken care by the underlying CloudOS. Cloud MapReduce doesn’t need to worry about disk I/O issue because all storage is effectively remote and being taken care by the Cloud OS. First, it is faster than other implementations (e.g., 60 times faster than Hadoop in one case). Second, it is more scalable because it has no single point of bottleneck. Third, it is dramatically simpler with only 3,000 lines of co de (e.g., two orders of magnitude simpler than Hadoop). A cloud OS’ scalability comes at a price. To scale, the Amazon cloud OS not only relies on horizontal scaling, but it also adopts a weaker consistency model called eventual consistency. We describe how we overcome the limitations presented by horizontal scaling and the weaker consistency guarantee. We believe that building highly-scalable systems on top of a scalable cloud OS is a promising approach, and Cloud MapReduce is a concrete illustration.