Publications
Detailed Information
Multi-Dimensional Range Partitioning for Parallel Joins in MapReduce : 맵리듀스에서의 병렬 조인을 위한 다차원 범위 분할 기법
Cited 0 time in
Web of Science
Cited 0 time in Scopus
- Authors
- Advisor
- 이상구
- Major
- 공과대학 전기·컴퓨터공학부
- Issue Date
- 2014-08
- Publisher
- 서울대학교 대학원
- Keywords
- Parallel Join ; Data Skew ; MapReduce ; Multi-Dimensional Range Partitioning
- Description
- 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 8. 이상구.
- Abstract
- Joins are fundamental operations for many data analysis tasks, but are not directly supported by the MapReduce framework. This is because 1) the framework is basically designed to process a single input data set, and 2) MapReduce's key-equality based data grouping method makes it difficult to support complex join conditions. As a result, a large number of MapReduce-based join algorithms have been proposed.
As in traditional shared-nothing systems, one of the major issues in join algorithms using MapReduce is handling of data skew. We propose a new skew handling method, called Multi-Dimensional Range Partitioning (MDRP), and show that the proposed method outperforms traditional skew handling methods: range-based and randomized methods. Specifically, the proposed method has the following advantages: 1) Compared to the range-based method, it considers the number of output tuples at each machine, which leads better handling of join product skew. 2) Compared with the randomized method, it exploits given join conditions before the actual join begins, so that unnecessary input duplication can be reduced.
The MDRP method can be used to support advanced join operations such as theta-joins and multi-way joins. With extensive experiments using real and synthetic data sets, we evaluate the effectiveness of the proposed algorithm.
- Language
- English
- Files in This Item:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.