Software Optimization Techniques for Deep Learning Applications on AI Hardware Platforms

Abstract: 임베디드 시스템에서 딥 러닝 애플리케이션에 대한 증가하는 수요를 충족하기 위해 새로운 임베디드 디바이스에는 GPU와 뉴럴 프로세싱 유닛(NPU)이라고 하는 딥 러닝 하드웨어 가속기를 비롯한 여러 이기종 프로세서가 포함되는 경향이 나타나고 있다. 또한, 딥 러닝 애플리케이션의 빠르고 효율적인 개발을 위해 소프트웨어 개발 키트(SDK)가 제공된다. 딥 러닝 SDK에는 딥 러닝 어플리케이션의 짧은 지연 시간과 높은 처리량을 위한 옵티마이저가 포함되어 있다.

딥 러닝 SDK는 내부적으로 추론을 최적화하지만, SDK는 추론이 GPU 또는 NPU 중 하나의 처리 요소에서 수행하며 두 프로세서를 같이 사용하여 추론을 수행하지 않는다. 그러나 단일 처리 요소에서 추론을 실행하면 시스템을 완전히 활용하지 못 한다. 시스템이 이기종 프로세서로 구성되어 있기 때문에 효율적으로 실행하려면 이러한 프로세서를 동시에 사용해야 할 필요가 있다.

다시 말해서 딥 러닝 어플리케이션을 시스템 레벨에서 최적화하는 것이 필요하다. 이러한 맥락에서 우리는 크게 세가지 주제로 해당 문제를 접근하였다. 이 논문에서는 하나의 딥 러닝 어플리케이션의 최적화, 실시간 제약 조건 하에서 여러 딥 러닝 어플리케이션의 최적화, 모델 기반 설계 방법론에서 딥 러닝 어플리케이션 지원이라는 세 가지 주요 주제를 다룬다. 본 논문에서는 NPU를 비롯한 이기종 프로세서가 탑재된 NVIDIA Jetson 임베디드 플랫폼과 빠른 추론을 위한 대표적인 딥 러닝 SDK인 TensorRT를 대상으로 한다.

먼저, 딥 러닝 추론의 처리량을 높이기 위한 체계적인 최적화 기법과 방법론을 제안한다. 멀티 스레딩, 파이프라이닝, 버퍼 할당, 네트워크 복제 등 딥 러닝 애플리케이션을 위한 병렬화 기법을 소개한다. 또한 딥 러닝 애플리케이션을 가속화하기 위한 다양한 최적화 파라미터를 지원하는 프레임워크를 공개한다. 최적화 기법은 파라미터화되어 있어 프레임워크의 입력 파일에서 파라미터를 조정하는 것만으로 딥 러닝 애플리케이션에 적용할 수 있다.
서로 다른 프로세싱 요소에 레이어를 할당하고 다른 파라미터를 최적화하는 설계 공간은 방대하기 때문에 이기종 프로세서 간의 파이프라인 단계 균형을 맞추기 위한 휴리스틱과 파라미터 탐색 프로세스로 구성된 파라미터 최적화 방법론을 제안한다.
이는 TensorRT를 사용하는 딥 러닝 애플리케이션을 파티셔닝하고 NPU를 포함한 이기종 프로세서 시스템에서 처리량을 개선한 최초의 작업이다.
9개의 실제 벤치마크를 통해 GPU만을 사용한 추론에 비해 101% ~ 680% 처리량 향상과 최대 55% 에너지 감소를 달성할 수 있었다.

두 번째로, 여러 기능을 제공하기 위해 여러 딥러닝 애플리케이션을 동시에 실행하는 것이 대중화되고 있다. 이 연구에서는 애플리케이션에 런타임에 따라 달라질 수 있는 실시간 제약 조건이 있다고 가정한다. 최근 다양한 하드웨어 플랫폼에서 여러 딥러닝 애플리케이션의 효율적인 매핑을 찾기 위한 광범위한 연구가 수행되었지만, 실제 임베디드 플랫폼에서 NPU와 해당 SDK에 의해 부과되는 제약 조건은 고려하지 않았다. 이 연구에서는 여러 이기종 프로세서가 있는 실제 임베디드 시스템에서 여러 딥 러닝 애플리케이션의 새로운 에너지 인지형 매핑 방법론을 제안한다. 모든 애플리케이션의 실시간 제약 조건을 만족하면서 에너지 소비를 최소화하는 것이 목표이다. 제안한 방식에서는 먼저 각 애플리케이션에 대한 파레토 최적 매핑 솔루션을 선택한다. 그런 다음 제약 조건을 만족하면서 애플리케이션의 동적 특성을 보여주는 시나리오를 고려하여 매핑 조합을 탐색한다. 또한 프로세서의 주파수를 조정하여 에너지 소비를 줄인다.
이는 NPU를 포함하는 실제 하드웨어 플랫폼에서 TensorRT 기반 멀티플 애플리케이션의 동시 실행을 한 최초의 작업이다.
실제 플랫폼에서 실제 애플리케이션과 다양한 시나리오를 사용하여 정적 매핑 방법에 비해 최대 40% 더 높은 마감 시간 제약을 만족하고 에너지 소비를 22% ~ 31%까지 줄일 수 있었다.

마지막으로 딥 러닝 애플리케이션이 임베디드 시스템에 널리 보급됨에 따라 모델 기반 임베디드 소프트웨어 설계 방법론에서 딥 러닝 애플리케이션을 지원하는 방법은 어려운 문제가 되고 있다.
지금까지의 해결책은 각 딥 러닝 애플리케이션을 모델로 표현하는 것이다. 그러나 딥 러닝 애플리케이션에 최적화 기법을 적용하여 모델로 변환하고 좋은 성능을 얻기 위해서는 상당한 노력이 필요하다. 본 연구에서는 성능 최적화를 위해 딥 러닝 SDK를 활용하는 새로운 방법론을 제안한다. 제안하는 방법론에서는 먼저 하드웨어 플랫폼과 연동된 SDK를 이용하여 딥 러닝 애플리케이션의 파레토 최적 매핑 솔루션을 얻는다. 그런 다음 유전 알고리즘을 사용하여 데이터플로우 태스크의 매핑과 딥러닝 애플리케이션의 매핑 솔루션 탐색을 동시에 수행한다.
동기 부여 예제와 무작위로 생성된 그래프를 사용한 실험 결과, 딥 러닝 애플리케이션과 데이터플로우 기반 애플리케이션을 순차적으로 매핑하는 이전 작업과 비교했을 때 프로세싱 요소의 최대 사용률을 최소 5% 이상 줄일 수 있는 것도 확인하였다.
To meet the growing demand for deep learning applications in embedded systems, new embedded devices tend to include multiple heterogeneous processors, including a GPU and a deep learning hardware accelerator called a neural processing unit (NPU). In addition, a software development kit (SDK) is provided for fast and efficient development of deep learning applications. The deep learning SDK includes optimizer that delivers low latency and high throughput for deep learning inference applications.

Even the deep learning SDK optimize the inference internally, the SDK assumes that inference is performed on a single processing element, either the GPU or the NPU, but not both. However, running inference on a single processing element does not fully utilize the system. Since the system consists of heterogeneous processors, it is necessary to use these processors simultaneously to run efficiently.

In other words, it is necessary to optimize deep learning applications at the system-level. In this context, we approach the problem from three main topics: optimization of a single deep learning application, optimization of multiple deep learning applications under real-time constraints, and support for deep learning applications in model-based embedded software design methodology.
In this work, we target the NVIDIA Jetson embedded platform with heterogeneous processors, including NPUs, and TensorRT which is a leading deep learning SDK for fast inference.

First, we devise systematic optimization techniques and methodology to increase the throughput of a single deep learning application. We present parallelization techniques for a deep learning application: multi-threading, pipelining, buffer allocation, and network duplication. We also present a framework that supports various optimization parameters to accelerate a deep learning application.
The optimization techniques are parameterized and can be applied to a deep learning application by merely adjusting parameters in a configuration file, which is an input to the framework. Since the design space of optimizing parameters is huge, we develop a parameter optimization methodology consisting of a heuristic for balancing pipeline stages among heterogeneous processors and a fine-tuning process for optimizing parameters. This is the first work to partition a deep learning inference which is developed with the TensorRT and improve throughput on the heterogeneous processor system including NPUs. With nine real-life benchmarks, we could achieve 101% ~ 680% performance improvement and up to 55% energy reduction over the baseline inference using GPU only.

Second, it is becoming popular to run multiple deep learning applications simultaneously to provide various functionalities. In addition, deep learning applications can have real-time constraints that vary at runtime. While extensive studies have been conducted recently to find an efficient mapping of multiple deep learning applications on different hardware platforms, they do not consider the constraints imposed by the NPU and its SDK in a real embedded platform. In this work, we propose a novel energy-aware mapping methodology of multiple deep learning applications on a real embedded system with multiple heterogeneous processors. The objective is to minimize energy consumption while satisfying the real-time constraints of all applications. In the proposed scheme, we first select Pareto-optimal mapping solutions for each application. Then, the mapping combination is explored considering the scenario that shows the dynamics of the applications while satisfying the constraints. We also reduce energy consumption by tuning the frequency of the processors. This is the first work to consider the concurrent execution of multiple deep learning applications which are developed with the TensorRT on a real hardware platform. We could satisfy up to 40% higher deadline constraints and reduce energy consumption by 22% ~ 31% compared to the static mapping methods with real-life applications and different scenarios on a real platform.

Finally, as deep learning applications become more prevalent in embedded systems, how to support deep learning applications in model-based embedded software design methodologies becomes a challenging problem. One solution so far is to represent each deep learning application with a model. However, it requires considerable effort to translate the specifications and achieve good performance by applying optimization techniques to deep learning applications. In this work, we propose a novel methodology that takes advantage of using a deep learning SDK for performance optimization. In the proposed method, we first obtain the Pareto-optimal mapping solutions of deep learning applications using the SDK associated with the hardware platform. Then, we jointly perform the mapping of dataflow tasks and the selection of mapping solutions for deep learning applications using a genetic algorithm and a heuristic. Experiments with a real-life example and randomly generated graphs show that we could reduce at least 5% of the maximum utilization compared to our previous work that maps deep learning applications and dataflow applications sequentially.

Language: eng

URI: https://hdl.handle.net/10371/196507

https://dcollection.snu.ac.kr/common/orgView/000000177232

Files in This Item:

000000177232.pdf 6.09 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share