Large-Scale Array Processing and Management in Distributed Systems

김상철

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Large-Scale Array Processing and Management in Distributed Systems : 분산시스템에서 대규모 배열 처리와 관리

DC Field	Value	Language
dc.contributor.advisor	문봉기	-
dc.contributor.author	김상철	-
dc.date.accessioned	2022-12-29T07:44:52Z	-
dc.date.available	2022-12-29T07:44:52Z	-
dc.date.issued	2022	-
dc.identifier.other	000000172538	-
dc.identifier.uri	https://hdl.handle.net/10371/187780	-
dc.identifier.uri	https://dcollection.snu.ac.kr/common/orgView/000000172538	ko_KR
dc.description	학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 문봉기.	-
dc.description.abstract	Scientific observation, simulation, and experiments produce large amounts of scientific data. With such scientific data represented as multi-dimensional arrays, there has been a need to enhance complex analytics. The array data model is appropriate for scientific data management because it can physically cluster data close to each other, which ensures the locality of coordinate systems. In this dissertation, we focus on large-scale array processing and management based on the array data model. We present Spangle implemented on top of Apache Spark, a popular map-reduce framework for complex computation workloads. By adopting the array data model, Spangle facilitates scientific analysis using raster data and machine learning algorithms heavily relying on linear algebra. Moreover, we employ SciDB, a popular array-based DBMS, to improve array processing and usability. First, we present an efficient approach to query processing in the Filter operator, which examines attributes and coordinates with a full scan regardless of given conditions. This approach enables the filter operator to perform a selective scan using array indexes for given spatial information. Next, we propose the scalable loader, SLS. It streamlines the conversion process and modifies the distribution method in the loading stages. Also, it eliminates two heavy-duty steps: sort and redistribution, which account for a dominant portion of data loading. Last, we propose SDF, which facilitates data sharing and exchange to support complex analytics with minimal integration overhead. By adopting the principles of a federated database system, SDF abstracts away integrating processes while retaining the primary authority of each database and preserving system features such as analytics libraries.	-
dc.description.abstract	과학분야의 관측과 시뮬레이션 및 실험은 대규모 과학 데이터를 생성하며, 다차원 배열로 생성되는 과학 데이터의 복잡한 분석에서 성능 향상이 지속적으로 요구 되어져 왔다. 과학 데이터를 관리하기 위해 배열 데이터 모델을 주로 사용하는데, 이 모델은 물리적으로 서로 가깝게 데이터를 저장 관리하여 좌표 시스템의 지역성을 보장할 수 있다. 본 논문에서는 배열 데이터 모델을 기반으로 데이터 처리 및 관리에 중점을 두고 있다. 우선, 맵-리듀스 프레임워크인 Apache Spark에서 복잡한 배열 처리 계산 위한 시스템인 Spangle을 소개한다. Spangle은 배열 데이터 모델을 채택하여 래스터 데이터의 분석이나 선형 대수에 크게 의존하는 기계 학습 알고리즘을 용이하게 처리할 수 있도록 고안하였다. 더불어, 배열 기반 DBMS중 하나인 SciDB 를 사용해 배열 처리 및 사용성을 개선을 하였다. 첫번째로 filter 연산자에서 주어 진 조건에 관계없이 데이터 전체를 읽는 연산을 개선해, 조건에 따라 공간 정보를 활용한 부분적으로 읽는 방법을 제시하였다. 다음으로, 효율적인 데이터 적재를 제 공하는 SLS 를 제안한다. 변환 과정을 간소화하고, 특히 대부분을 차지하는 데이터 적재에서 정렬과 재 배포를 제거하는 방법을 사용하였다. 마지막으로는 복잡한 분석 을 위해 최소한의 부하로 데이터를 연동할수 있는 SDF 시스템을 제안한다. SDF 는 연동 과정을 추상화하여 서로 다른 데이터베이스의 기본 권한을 보존하면서 쿼리 연산을 가능하게 하는 시스템이다.	-
dc.description.tableofcontents	1 Introduction 1 1.1 Motivation 2 1.2 Contributions 4 1.3 Outline 5 2 Background 6 2.1 Array Data Model 6 2.2 Overview of Apache Spark 12 2.3 Overview of SciDB 13 2.4 Related Work 18 3 Spangle: A Distributed In-Memory Processing System for Large-Scale Arrays 23 3.1 Architecture of Spangle. 25 3.1.1 System Overview 25 3.1.2 ArrayRDD 27 3.1.3 Metadata and Mapper 29 3.2 Bitmask 29 3.2.1 Null Value in RasterData 30 3.2.2 Chunk Management 31 3.2.3 Bitmask Operations 33 3.2.4 Space and Time Complexity 36 3.3 Programming Interface 37 3.3.1 Operators using Bitmasks 38 3.3.2 Aggregate Framework 40 3.4 Machine Learning 41 3.4.1 Local Join for Matrix Multiplication 41 3.4.2 Graph Representation and PageRank 42 3.4.3 Stochastic Gradient Descent 43 3.5 Experiments 46 3.5.1 Experimental Setup 46 3.5.2 Raster Data Processing 47 3.5.3 Machine Learning 54 3.6 Summary 59 4 Advanced Database Techniques for Large-Scale Arrays 61 4.1 Selective Scan for Filter Operator 63 4.2 Scalable Loader 66 4.2.1 Loading 68 4.2.2 Redimensioning 70 4.3 Federated Database System for Scientific Data 71 4.3.1 Operator Syntax 73 4.3.2 Federated Query Processing 74 4.3.3 System Architecture 75 4.3.4 Connection Model 77 4.3.5 System Considerations 80 4.4 Experiments 83 4.4.1 Selective Scan for Filter Operator 84 4.4.2 Scalable Loader 86 4.4.3 Federated Database System for Scientific Data 89 4.5 Summary 99 5 Conclusion 100 5.1 Contributions 100 5.2 Future Direction 102 Abstract (In Korean) 116	-
dc.format.extent	viii, 116	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	Arraydatamodel	-
dc.subject	Scientificdata	-
dc.subject	ArrayProcessing	-
dc.subject	ArrayManagement	-
dc.subject.ddc	621.39	-
dc.title	Large-Scale Array Processing and Management in Distributed Systems	-
dc.title.alternative	분산시스템에서 대규모 배열 처리와 관리	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	Sangchul Kim	-
dc.contributor.department	공과대학 컴퓨터공학부	-
dc.description.degree	박사	-
dc.date.awarded	2022-08	-
dc.identifier.uci	I804:11032-000000172538	-
dc.identifier.holdings	000000000048▲000000000055▲000000172538▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Files in This Item:

000000172538.pdf 5.77 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share