Efficient Execution of Machine Learning Workloads on GPUs

유경인

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Efficient Execution of Machine Learning Workloads on GPUs : GPU 환경에서 머신러닝 워크로드의 효율적인 실행

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 유경인

Advisor: 전병곤

Issue Date: 2023

Publisher: 서울대학교 대학원

Keywords: machine learning ; deep learning ; scheduling ; inference serving ; generative models ; Transformer ; joint training

Description: 학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 2. 전병곤.

Abstract: Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently.
We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34×. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems – 36.9× throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines.
최근 경향을 보면 다양한 종류의 애플리케이션에서 머신 러닝(ML) 워크로드가 점 점 더 중요하게 활용되고 있다. 이는 ML용 시스템 소프트웨어의 개발을 통해 GPU 와 같은 이기종 가속기의 광범위한 활용이 가능해졌기 때문이다. 많은 연구자들의 관심 덕에 ML용 시스템 소프트웨어 스택은 분명 하루가 다르게 개선되고 있지만, 여전히 모든 사례에서 높은 효율성을 보여주지는 못한다. 이 학위논문에서는 시스 템 소프트웨어 관점에서 GPU 환경에서 ML 워크로드의 실행 효율성을 개선하는 방법을 연구한다. 구체적으로는 오늘날의 ML용 시스템이 GPU를 효율적으로 사 용하지 못하는 워크로드를 규명하고 더 나아가서 해당 워크로드를 효율적으로 처리할 수 있는 시스템 기술을 고안하는 것을 목표로 한다.
본 논문에서는 먼저 최적화된 GPU 스케줄링을 갖춘 ML 실행 엔진인 Nimble 을 소개한다. 새 스케줄링 기법을 통해 Nimble은 기존 대비 GPU 실행 효율성 을 최대 22.34배까지 향상시킬 수 있다. 둘째로 Transformer 기반의 생성 모델에 특화된 추론 서비스 시스템 Orca를 제안한다. 새로운 스케줄링 및 batching 기 술에 힘입어, Orca는 동일한 수준의 지연 시간을 기준으로 했을 때 기존 시스템 대비 36.9배 향상된 처리량을 보인다. 마지막으로 신경망을 사용하지 않는 고전 ML 파이프라인을 신경망으로 변환하는 프레임워크 WindTunnel을 소개한다. 이 를 통해 고전 ML 파이프라인 학습을 GPU를 사용해 진행할 수 있게 된다. 또한 WindTunnel은 gradient backpropagation을 통해 파이프라인의 여러 요소를 한 번에 공동으로 학습 할 수 있으며, 이를 통해 파이프라인의 정확도를 더 향상시킬 수 있음을 확인하였다.

Language: eng

URI: https://hdl.handle.net/10371/193329

https://dcollection.snu.ac.kr/common/orgView/000000175556

Files in This Item:

000000175556.pdf 6.86 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share