Accelerating Transformer-Based Model Inference using Efficient Matrix Multiplications on GPUs

Abstract: Transformer-based models have become the backbone of many state-of-the-art natural language processing (NLP) and computer vision tasks. As existing powerful models become large, enabling the models to learn and represent complex data relationships. Additionally, increasing the input sequence can be an effective way to improve performance for challenging real-world tasks. However, high inference cost hinders the use of powerful transformers because of large memory footprint, quadratic complexity with input sequence length in attention layers, and inefficient kernel operations.

In this thesis, we propose Transformer optimization methods to reduce inference costs in various scenarios, depending on the model size, input sequence length, and batch size. First, we propose Multigrain, an optimization method for scenarios where the input length (Lin) is significantly greater than the hidden dimension (Dh). Existing sparse attention techniques can effectively reduce computation and memory footprints in long input sequences; however, they are inefficiently processed on GPUs and still account for the majority of the execution time. Multigrain takes into account the sparse patterns of sparse attention, processing the coarse-grained part with a coarse-grained kernel using high-performance tensor cores and the fine-grained part with a fine-grained kernel using CUDA cores, respectively. As a result, Multigrain achieves a 2.07x end-to-end speedup over DeepSpeed when running Longformer inference.

Second, we propose a tiled singular value decomposition (TSVD) method to reduce inference costs in scenarios where Lin is similar to or smaller than Dh. TSVD is a technique that divides a matrix into tiles, performs singular value decomposition (SVD) on each tile, and compresses the matrix using low-rank approximation. By performing matrix multiplication, the fundamental operation of attention layers and feed-forward layers in Transformer models, using low-rank approximation-based TSVD-matmul, memory footprint and computation can be reduced, significantly lowering inference costs. Consequently, when compressing matrices by 2 to 8x, TSVD-based matrix multiplication is 1.02 to 2.26x faster than the uncompressed matrix multiplication. However, when applying TSVD to models, the execution time is reduced, but there is a trade-off in decreased accuracy.

To address this issue, we propose TSVD-common, a parameter-efficient fine-tuning method based on TSVD. TSVD-common shares one of the submatrices decomposed by SVD in each tile across all tiles and fine-tunes only the common submatrix during training. As a result, TSVD-common improves accuracy by approximately 2% even when compressing the GPT-2 model by 2 or 4x in E2E NLG tasks, compared to full fine-tuning without compression.
최근 Transformer 기반의 모델들은 자연어 처리와 컴퓨터 비전 등 다양한 분야에서 높은 성능을 보여주고 있다. 기존 강력한 모델들은 커지면서 모델이 복잡한 데이터 관계를 학습하고 나타낼 수 있게 된다. 또한 입력 시퀀스 길이를 늘려 문맥학습을 향상시켜 복잡한 문제도 효과적으로 해결한다. 다만 이러한 모델들은 큰 메모리 사용량, 어텐션 레이어에서 입력 길이에 의한 2차복잡도 문제, 또한 커널 최적화가 되어 있지 않아 높은 추론 비용을 야기한다.

본 논문에서는 Transformer 기반 모델들의 크기, 입력 시퀀스 길이, 배치 크기에 따라 추론 비용을 줄이는 최적화 방법을 제안한다. 먼저, 입력 길이(Lin)가 은닉 차원(Dh)보다 큰 시나리오를 최적화하는 Multigrain 방법을 제안한다. 기존 희소 어텐션 기법은 긴 입력 시퀀스에서 연산량과 메모리 사용량을 효과적으로 줄일 수 있지만 GPU에서 비효율적으로 처리되며 여전히 대부분 수행시간을 차지한다. Multigrain은 희소 어텐션의 복합적인 희소 패턴을 파악하고 거친 희소 패턴은 고성능 텐서 코어를 사용한 커널로 처리하고 세밀한 패턴은 CUDA 코어를 사용한 커널로 각각 멀티 스트림으로 동시에 처리한다. 그 결과로 Longformer 모델을 DeepSpeed에서 추론을 실행한 기준 시스템에 비해 2.07배 더 빠른 것을 보여준다.

그리고 본 논문에서는 Lin이 Dh와 비슷하거나 작은 시나리오에서 추론 비용을 줄이는 tiled singular value decomposition(TSVD) 방법을 제안한다. TSVD는 행렬을 타일로 나누고 각 타일을 특이값 분해(SVD)하며 저랭크 근사를 이용하여 행렬을 압축하는 기법이다. Transformer 기반 모델에서 어텐션 레이어와 피드포워드 레이어의 기본 연산인 행렬 곱을 저랭크 근사를 이용한 TSVD기반의 행렬 곱으로 수행하면 메모리 사용량을 줄일 수 있고 연산량도 줄일 수 있으므로 추론 비용을 상당히 줄일 수 있다. 결과적으로 행렬을 2배~ 8배까지 압축 시, TSVD기반의 행렬 곱은 압축하지 않은 행렬 곱보다 1.02배--2.26배 빠른 것을 보인다. 다만 모델에 적용 시 수행시간이 줄어들지만 정확도가 하락하는 문제점이 존재한다.

이러한 문제점을 해결하기 위해 본 논문에서는 TSVD 기반의 매개변수 효율적 미세조정(parameter efficient fine-tuning) 방법인 TSVD-common을 제안한다. 각 타일에서 SVD로 분리된 두 서브행렬들 중 하나를 모든 타일에서 공유하는 형태로 하고 공동의 해당 서브행렬만 미세조정 시켜 학습시키는 방법이다. 결과적으로 제안한 TSVD-common은 GPT2 모델에서 2배 또는 4배 압축 시 E2E 태스크에서는 압축하지 않은 전체 매개변수를 미세조정하는 방법(full fine-tuning)보다 정확도가 2%정도 향상되었고 매개변수 효율적 미세조정 최신 방법인 LoRA와 근접한 정확도를 보여준다.

Language: eng

URI: https://hdl.handle.net/10371/197058

https://dcollection.snu.ac.kr/common/orgView/000000178073

Files in This Item:

000000178073.pdf 1.76 MB

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Transdisciplinary Studies(융합과학부)
  - Theses (Ph.D. / Sc.D._융합과학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share