Practical Optimizations for Conjugate Gradient Method Acceleration using CUDA

유동한

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Practical Optimizations for Conjugate Gradient Method Acceleration using CUDA : 쿠다를 이용한 실용적인 켤레기울기법 가속에 관한 연구

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 유동한

Advisor: 고형석

Major: 자연과학대학 협동과정 계산과학전공

Issue Date: 2016-08

Publisher: 서울대학교 대학원

Keywords: lazy residual evalution ; conjugage gradient method ; CUDA

Description: 학위논문 (석사)-- 서울대학교 대학원 : 협동과정 계산과학전공, 2016. 8. 고형석.

Abstract: This dissertation presents a series of optimizations for preconditioned and nonpreconditioned
the Conjugate Gradient(henceforth, CG) method using CUDA.
Each lines of CG algorithm has data dependency on adjacent lines but each
step is parallelizable operation like matrix-vector multiplication, dot product,
and axpy operation. Because each step is well-known parallelizable operation,
overall CG algorithm speed can be accelerated by GPUs and meaningful
speedup can be seen with the optimization methods presented in this dissertation.
First, we describe performance issues from na¨ıve version of CUDA based
CG implemented using an widely adopted CUDA library package: cuBLAS.
This library provides generic low level algorithms that can be useful to implement
high level algorithms without being focused on writing performant
CUDA kernels. However, device-host synchronizations limit the performance
gains from CUDA acceleration due to the data dependency of conjugate gradient
algorithm steps if that is implemented without a care. GPUs could be
i
severely under-utilized between each step and GPUs cannot be run at full
speed. We proposed a simple but practical optimization technique to avoid device
and host synchronizations: Lazy residual evaluation.
In this thesis, the overall runtime performance gain by eliminating devicehost
synchronizations are explained one by one as the number of synchronizations
per iteration is reduced. In the meantime, the changes on CPU and GPU
pipeline are explained with illustration as well. Then, the performance gain
from the proposed method, Lazy residual evaluation, and advantages or disadvantages
are compared against other backend implementations with different
level of device-host synchronizations.
Finally, importance of device and host synchronization minimization is expressed
in details when accelerating iterative algorithms similar to CG using
GPUs.

Language: English

URI: https://hdl.handle.net/10371/131255

Files in This Item:

000000136008.pdf 3.89 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Program in Computational Science and Technology (협동과정-계산과학전공)
  - Theses (Master's Degree_협동과정-계산과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share