Publications

Detailed Information

Multi-Dimensional Range Partitioning for Parallel Joins in MapReduce : 맵리듀스에서의 병렬 조인을 위한 다차원 범위 분할 기법

DC Field Value Language
dc.contributor.advisor이상구-
dc.contributor.author명재석-
dc.date.accessioned2017-07-13T07:05:54Z-
dc.date.available2017-07-13T07:05:54Z-
dc.date.issued2014-08-
dc.identifier.other000000021558-
dc.identifier.urihttps://hdl.handle.net/10371/119034-
dc.description학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 8. 이상구.-
dc.description.abstractJoins are fundamental operations for many data analysis tasks, but are not directly supported by the MapReduce framework. This is because 1) the framework is basically designed to process a single input data set, and 2) MapReduce's key-equality based data grouping method makes it difficult to support complex join conditions. As a result, a large number of MapReduce-based join algorithms have been proposed.

As in traditional shared-nothing systems, one of the major issues in join algorithms using MapReduce is handling of data skew. We propose a new skew handling method, called Multi-Dimensional Range Partitioning (MDRP), and show that the proposed method outperforms traditional skew handling methods: range-based and randomized methods. Specifically, the proposed method has the following advantages: 1) Compared to the range-based method, it considers the number of output tuples at each machine, which leads better handling of join product skew. 2) Compared with the randomized method, it exploits given join conditions before the actual join begins, so that unnecessary input duplication can be reduced.

The MDRP method can be used to support advanced join operations such as theta-joins and multi-way joins. With extensive experiments using real and synthetic data sets, we evaluate the effectiveness of the proposed algorithm.
-
dc.description.tableofcontentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
II. Backgrounds and RelatedWork . . . . . . . . . . . . . . . . 8
2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . 11
2.2.1 Two-Way Join Algorithms . . . . . . . . . . . . . . 11
2.2.2 Multi-Way Join Algorithms . . . . . . . . . . . . . 17
2.3 Data Skew in Join Algorithms . . . . . . . . . . . . . . . . 18
2.4 Skew Handling Approaches in MapReduce . . . . . . . . . 22
2.4.1 Hash-Based Approach . . . . . . . . . . . . . . . . 22
2.4.2 Range-Based Approach . . . . . . . . . . . . . . . 24
2.4.3 Randomized Approach . . . . . . . . . . . . . . . . 26
III. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Multi-Dimensional Range Partitioning . . . . . . . . . . . . 29
3.1.1 Creation of a Partitioning Matrix . . . . . . . . . . . 29
3.1.2 Identifying and Chopping of Heavy Cells . . . . . . 31
3.1.3 Assigning Cells to Reducers . . . . . . . . . . . . . 33
3.1.4 Join Processing using the Partitioning Matrix . . . . 35
3.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . 39
3.3 Complex Join Conditions . . . . . . . . . . . . . . . . . . . 41
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Scalar Skew Experiments . . . . . . . . . . . . . . . 44
3.4.2 Zipfs Distribution . . . . . . . . . . . . . . . . . . 49
3.4.3 Non-Equijoin Experiments . . . . . . . . . . . . . . 50
3.4.4 Scalability Experiments . . . . . . . . . . . . . . . 52
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.2 Memory-Awareness . . . . . . . . . . . . . . . . . 58
3.5.3 Handling of Heavy Cells . . . . . . . . . . . . . . . 59
3.5.4 Existing Histograms . . . . . . . . . . . . . . . . . 60
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
IV. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 Joining Multiple Relations in a MapReduce Job . . . . . . . 65
4.1.1 Example: SPARQL Basic Graph Pattern . . . . . . . 65
4.1.2 Example: Matrix Chain Multiplication . . . . . . . . 67
4.1.3 Single-Key Join and Multiple-Key Join Queries . . . 69
4.2 Skew Handling for Multi-Way Joins . . . . . . . . . . . . . 71
4.2.1 Skew Handling for SK-Join Queries . . . . . . . . . 71
4.2.2 Skew Handling for MK-Join Queires . . . . . . . . 72
4.3 Combinations of SK-Join and MK-Join . . . . . . . . . . . 74
4.3.1 Complex Queries . . . . . . . . . . . . . . . . . . . 74
4.3.2 Iteration-Based Algorithms . . . . . . . . . . . . . . 75
4.3.3 Replication-Based Algorithms . . . . . . . . . . . . 77
4.3.4 Iteration-Based vs. Replication-Based . . . . . . . . 78
4.4 Join-Key Selection Algorithms for Complex Queries . . . . 83
4.4.1 Greedy Key Selection . . . . . . . . . . . . . . . . 84
4.4.2 Multiple Key Selection . . . . . . . . . . . . . . . . 85
4.4.3 Hybrid Key Selection . . . . . . . . . . . . . . . . . 86
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 SK-Join Experiments . . . . . . . . . . . . . . . . . 87
4.5.2 MK-Join Experiments . . . . . . . . . . . . . . . . 89
4.5.3 Analysis of TV Watching Logs . . . . . . . . . . . . 90
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
V. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1 Algorithms for SPARQL Basic Graph Pattern . . . . . . . . 94
5.1.1 MR-Selection . . . . . . . . . . . . . . . . . . . . . 95
5.1.2 MR-Join . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.3 Performance Evaluation . . . . . . . . . . . . . . . 101
5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Algorithms for Matrix Chain Multiplication . . . . . . . . . 107
5.2.1 Serial Two-Way Join (S2) . . . . . . . . . . . . . . 109
5.2.2 Parallel M-Way Join (P2, PM) . . . . . . . . . . . . 111
5.2.3 Serial Two-Way vs. Parallel M-Way . . . . . . . . . 115
5.2.4 Performance Evaluation . . . . . . . . . . . . . . . 116
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 119
5.2.6 Extension: Embedded MapReduce . . . . . . . . . . 119
VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
초록 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
-
dc.formatapplication/pdf-
dc.format.extent4676232 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectParallel Join-
dc.subjectData Skew-
dc.subjectMapReduce-
dc.subjectMulti-Dimensional Range Partitioning-
dc.subject.ddc621-
dc.titleMulti-Dimensional Range Partitioning for Parallel Joins in MapReduce-
dc.title.alternative맵리듀스에서의 병렬 조인을 위한 다차원 범위 분할 기법-
dc.typeThesis-
dc.description.degreeDoctor-
dc.citation.pagesix, 134-
dc.contributor.affiliation공과대학 전기·컴퓨터공학부-
dc.date.awarded2014-08-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share