S-Space College of Engineering/Engineering Practice School (공과대학/대학원) Dept. of Electrical and Computer Engineering (전기·정보공학부) Theses (Ph.D. / Sc.D._전기·정보공학부)
Join Processing with Filtering Techniques on MapReduce Cluster
맵리듀스 클러스터에서 필터링 기법을 사용한 조인 처리
- 공과대학 전기·컴퓨터공학부
- Issue Date
- 서울대학교 대학원
- 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.
- The join operation is one of the essential operations for data analysis because it is necessary to join large datasets to analyze heterogeneous data collected from different sources. MapReduce is a very useful framework for large-scale data analysis, but it is not suitable for joining multiple datasets. This is because it may produce a large number of redundant intermediate results, irrespective of the size of the joined records. Several existing approaches have been employed to improve the join performance, but they can only be used in specific circumstances or they may require multiple MapReduce jobs. To alleviate this problem, MFR-Join is proposed in this dissertation, which is a general join framework for processing equi-joins with filtering techniques in MapReduce. MFR-Join filters out redundant intermediate records within a single MapReduce job by applying filters in the map phase. To achieve this, the MapReduce framework is modified in two ways. First, map tasks are scheduled according to the processing order of the input datasets. Second, filters are created dynamically with the join keys of the datasets in a distributed manner. Various filtering techniques that support specific desirable operations can be plugged into MFR-Join. If the performance of join processing with filters is worse than that without filters, adaptive join processing methods are also proposed. The filters can be applied according to their performance, which is estimated in terms of the false positive rate. Furthermore, two map task scheduling policies are also provided: synchronous and asynchronous scheduling. The concept of filtering techniques is extended to multi-way joins. Methods for filter applications are proposed for the two types of multi-way joins: common attribute joins and distinct attribute joins. The experimental results showed that the proposed approach outperformed existing join algorithms and reduced the size of intermediate results when small portions of input datasets were joined.