Publications

Detailed Information

Elastic Distributed Training of Deep Neural Networks : 딥 뉴럴 네트워크의 탄력적 분산학습

DC Field Value Language
dc.contributor.advisor전병곤-
dc.contributor.author이경근-
dc.date.accessioned2022-04-05T05:51:13Z-
dc.date.available2022-04-05T05:51:13Z-
dc.date.issued2021-
dc.identifier.other000000167904-
dc.identifier.urihttps://hdl.handle.net/10371/177682-
dc.identifier.urihttps://dcollection.snu.ac.kr/common/orgView/000000167904ko_KR
dc.description학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 이경근.-
dc.description.abstractAs the training of Deep Neural Network (DNN) models relies more and more heavily on the shared GPU clusters or cloud computing services, elastic training of DNN has much potential gain for both the users and the managers of the shared clusters, such as idle resource utilization, job completion time (JCT),
and responsiveness. However, making a distributed DNN training job elastic is not a trivial problem because we should handle the DNN training jobs states appropriately upon scaling events. Moreover, it is even more challenging to achieve both efficient scaling mechanism and correct job state management, which are the two conflicting goals. In this paper, we discuss the problem of state management in an elastic distributed DNN training jobs, and propose a design for fast and safe elastic DNN training system that can support various types
of training jobs. We implemented an elastic training framework, named Elastic Parallax, and validated our system on the data-parallel training workloads.
-
dc.description.abstract딥 뉴럴 네트워크(DNN) 모델들이 점점 공유 GPU 클러스터 또는 클라우드 컴퓨팅 서비스에 의존하게 됨에 따라, 유휴자원 활용, JCT, 반응성 등, 클러스터 사용자와 관리자 모두에게 있어 탄력적 학습을 지원하는 것의 잠재적 이점이 많아지고 있다. 그러나 분산 DNN 학습 작업을 탄력적으로 동작하게 만드는 것은 어려운 일인데, 왜냐하면 DNN 학습 작업을 탄력적이게 만들려면 스케일링 시마다 작업의 상태를 적절하게 관리해 주어야 하기 때문이다. 게다가, 효율적인 스케일링 메카니즘과 적절한 작업 상태 관리는 동시에 이루기 어려운 목표들이다. 따라서 본 논문에서는, 탄력적 분산 DNN 학습 작업의 상태 관리 문제를 논의하고, 이를 바탕으로 다양한 종류의 학습 작업을 지원할 수 있는 빠르고 안전한 탄력적 DNN 학습 시스템 디자인을 제안한다. 또한, 탄력적 학습 프레임워크인 Elastic Parallax를 직접 구현하고, 실제 데이터 병렬 학습 작업들에 대하여 시스템을 검증한다.-
dc.description.tableofcontentsAbstract 1
1 Introduction 5
2 Background 8
2.1 Distributed DNN Training 8
2.2 Elastic Distributed Training 10
3 Problem Statement 12
3.1 Definitions 12
3.1.1 State and State Consistency 12
3.1.2 Elasticity 13
3.2 State Management Problem of Elastic DNN Training 14
4 State Synchronization for Elastic Training 16
4.1 Classification of State Constraints 16
4.2 State Synchronization Operations 17
4.2.1 Replicated States 18
4.2.2 Partitioned States 18
4.2.3 Singleton States 18
4.3 Implication on API Design 19
5 API and System Design 20
5.1 API Design 20
5.2 System Architecture 22
5.3 Implementation 24
5.3.1 Two-Phase Rendezvous 24
5.3.2 Elastic Input Pipeline 25
6 Evaluation 26
6.1 Evaluation Setup 26
6.1.1 Environment 26
6.1.2 Workloads 26
6.2 Replicated Data Parallelism 27
6.3 Partitioned Data Parallelism 28
7 Related Work 33
7.1 Elastic Machine Learning 33
7.2 Elastic DNN Training 33
8 Discussion and Conclusion 35
초록 42
-
dc.format.extentii, 42-
dc.language.isoeng-
dc.publisher서울대학교 대학원-
dc.subjectDeep Learning-
dc.subjectDistributed Training-
dc.subjectElasticity-
dc.subject딥러닝-
dc.subject분산학습-
dc.subject탄력성-
dc.subject.ddc621.39-
dc.titleElastic Distributed Training of Deep Neural Networks-
dc.title.alternative딥 뉴럴 네트워크의 탄력적 분산학습-
dc.typeThesis-
dc.typeDissertation-
dc.contributor.AlternativeAuthorKyunggeun Lee-
dc.contributor.department공과대학 컴퓨터공학부-
dc.description.degree석사-
dc.date.awarded2021-08-
dc.identifier.uciI804:11032-000000167904-
dc.identifier.holdings000000000046▲000000000053▲000000167904▲-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share