Moving to the Cloud–Developing Apps in the New World of Cloud Computing by Dinkar Sitaram & Geetha Manjunath

Moving to the Cloud–Developing Apps in the New World of Cloud Computing by Dinkar Sitaram & Geetha Manjunath

Author:Dinkar Sitaram & Geetha Manjunath
Language: eng
Format: epub
ISBN: 978-1-59749-725-1
Publisher: Syngress
Published: 2012-03-25T16:00:00+00:00


• partition(key2) : Determine a data partition and return the reducer number

The partition function is given the key with the number of reducers and returns the index of the desired reducer.

The combine and partition functions help in optimizing the execution of the parallel algorithm. The combine function reduces the unnecessary communication between the Mapper and Reducer functions by performing local consolidation of co-located data with same keys. The partition function can be used for efficient partitioning of the input data, for subsequent parallel execution. Typically, different records from data sources (could be different files or a set of lines from a given file or rows of a database) are used as the partitioning basis. Other sophisticated techniques such as horizontal partitioning in databases and data sharding, described in the earlier section, can also be implemented within this function. Sharding is most effective in a shared-nothing architecture such as the one in MapReduce and it can also use replication of shared data to achieve good performance.

Ideally, the communication between the Input data and the Mapper task can be minimized if we run the Mapper logic at the data split (without moving the data). However, this depends upon where the input data itself is stored and if it is possible to execute Mapper processes on the same node. For HDFS and Cassandra, it is possible to compute the Mapper task on the storage node itself and the Job Tracker takes the responsibility of co-locating the Mapper with the data split it processes, hence significantly reducing the data movement. On the other hand, pure data stores such as Amazon S3 do not allow execution of Mapper logic at the storage node. When running on Amazon Hadoop, it is necessary to create a Hadoop cluster in EC2, copy the data from S3 to EC2 (which is free), store intermediate results from MapReduce steps in HDFS on EC2, and write the final results back to S3 [37].

In general, the MapReduce APIs are very simple to use and allow specification of the parallelism in the application within the specific design paradigm of distributed merge-sort. The MapReduce platform (Hadoop for example) is expected to take care of automatic parallelization, fault tolerance load balancing, data distribution and network performance when implemented on a large network of clusters – as far as the specific application is concerned.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.