Options
Chisel++: Handling partitioning skew in MapReduce framework using efficient range partitioning technique
Date Issued
01-01-2014
Author(s)
Abstract
Job completion in MapReduce framework depends upon the slowest running reduce task. Inordinate time gap among the completion points of reduce tasks delays a job significantly. Synchronization in reduce task completion not only completes a job faster but also increases resource utilization. Completion time of reduce tasks depends mainly upon the data load given to each of them. Partitioning of keys plays a major role in load balancing input data to the reduce tasks. Range partitioning using sampling technique has been proposed in the literature over hash based partitioning to provide better key distribution. But data sampling is also susceptible to data/partitioning skew since only a small portion of the actual data is scanned. It is very difficult to design a perfect load balancing partitioner without having prior information about the entire data. Chisel, allows data skew to occur during partitioning but handles it dynamically by creating multiple partial reducers and one global reducer for the skewed partition. But this technique suffers from the global reducer becoming a bottleneck when the amount of reducer output is huge. We propose an efficient range partitioning technique which eliminates the use of global reducer and yields better performance. We also discuss various heuristics to implement range partitioning. Chisel++ achieved 2 times improvement in terms of job completion time over Chisel in our experiments. Copyright 2014 ACM.