Parallelism Management for Memory-Intensive Applications on Integrated CPU/GPU Architecture

Abstract: Integrated architectures combine CPU and GPU cores onto a single processor die. Thanks to the shared memory architecture, costly memory copies to the GPU device memory can be avoided, and programmers can use both CPU and GPU cores to accelerate their applications.
Especially for memory-intensive applications, however, utilizing all available core resources does not always maximize performance due to congestion in the shared memory system. Tuning an application to use the optimal number of CPU and GPU compute resources is a difficult and machine-dependent task.

This thesis presents an automated system that auto-tunes an OpenCL kernel for a given integrated architecture on the fly. A light-weight compiler extract the characteristics of a kernel as it is submitted to the OpenCL runtime. We then use a software-based technique to automatically rewrite the kernel to only utilize the compute resources that are expected to maximize performance based on a model.
Our analysis shows that the memory access pattern and the cache utilization are the decisive factors when determining the parallelism of a kernel. To accommodate for the varying properties of different integrated architectures, we employ machine learning to create models for different architectures. The models are trained with microbenchmarks that generate a large number of varying memory patterns and evaluated with OpenCL benchmarks from different benchmark suites.

While the presented approach shows good results for older integrated architectures such as Intel Skylake and AMD Kaveri, it currently still does not achieve satisfactory results on the more recent architectures Intel Comet Lake and AMD Picasso. However the analysis performed on memory pattern still has some insight on it.
CPU와 GPU가 메모리를 공유하는 Integrated graphics(Intel) 또는 APU(AMD) 시스템에서 어플리케이션을 동시에 작업하는 것은 두 기기 간의 메모리 복사에 걸리는 시간을 절약할 수 있기 때문에 효율적으로 보인다. 한편으로 메모리 연산이 많은 비중을 차지하는 어플리케이션의 경우, 서로 다른 메모리 접근 특징을 가진 두 기기가 동시에 같은 메모리에 접근하면서 병목현상이 일어난다. 이 때 작업에 필요한 데이터를 기다리면서 코어가 중지되는 상황이 발생하며 퍼포먼스를 낮춘다.

이 논문에서 우리는 소프트웨어 코드의 변형을 통해 제한된 개수의 코어만을 사용하는 방법을 소개한다. 그리고 이 방법을 사용하여 코어 개수를 조절할 때 모든 코어를 사용하는 것보다 더 좋은 결과를 낼 수 있음을 보인다. 이 때 어플리케이션의 메모리 접근 패턴이 최적 코어 개수와 관련이 있음을 확인하고, 코드를 기반으로 해당 패턴을 분석한다.

퍼포먼스는 또한 시스템의 특징과도 관련이 있다. 따라서 우리는 가능한 메모리 패턴들을 담고 있는 마이크로 벤치마크를 생성하여 이를 시스템에서 테스트한 결과를 기계학습 모델의 학습에 사용하였다. 그러나 메모리 패턴의 특징이 충분하지 않아 모델은 좋은 예측 결과를 내지 못했다. 비록 모델 예측 결과가 좋지 못했지만, CPU-GPU 통합 아키텍처의 메모리 접근에 대한 분석으로서 의미가 있을 것이다.

Language: eng

URI: https://hdl.handle.net/10371/178199

https://dcollection.snu.ac.kr/common/orgView/000000168427

Files in This Item:

000000168427.pdf 3.37 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share