파라미터 효율적 전이 학습 기법의 균질성 분석

Abstract: 사전학습 언어 모델은 문장 내 단어를 예측하는 학습 과정을 통해 전반적인 언어 능력을 습득한 모델이다. 자연 언어 처리는 이러한 사전학습 언어 모델의 도움을 받아 다양한 분야에서 괄목할 만한 성과를 이뤄내고 있다. 사전학습 결과를 기반으로 다른 응용 태스크를 학습할 경우 기존 모델에 비해 높은 성능을 보이며, 성능 향상 이외에도 같은 태스크에 대한 학습 결과들을 서로 가깝게 만드는 효과를 얻을 수 있다. 이러한 특성을 활용해 다수의 학습에서 얻은 가중치의 평균값을 학습 태스크에 대한 하나의 모델로 이용할 수 있다. 이 경우 추론 과정에 추가적인 연산 없이 학습 모델의 성능을 개선할 수 있다.
사전학습 언어 모델은 크기가 클 수록 더 좋은 성능을 보이며, 더 좋은 모델을 얻기 위해 천억 개 이상의 파라미터를 가진 거대 언어 모델이 만들어지고 있다. 거대 언어 모델은 기존 사전학습 모델을 상회하는 성능을 보이고 있지만, 각 응용 태스크에 맞게 조정하고 태스크 별 학습 결과를 저장하기에는 크기가 너무 크다는 단점이 있다. 이에 따라, 모델 전체를 학습하는 대신 적은 수의 파라미터만으로 모델 학습을 대신하는 파라미터 효율적 전이 학습 기법이 제안되고 있다.
기존 파라미터 효율적 전이 학습 관련 연구에서 적은 파라미터만 학습해 모델 자체를 미세 조정하는 수준의 학습 결과를 얻을 수 있는 다양한 기법을 제시하였다. 하지만, 각 학습 결과 사이의 유사성이 파라미터 효율적 전이 학습할 때도 유지되는지에 대한 연구는 부족하다. 본 논문에서는 대표적인 파라미터 효율적 전이 학습 기법인 LoRA를 사용할 때 데이터셋 별로 서로 다른 학습 결과 사이 유사성을 점검한다. 모델 자체를 미세 조정한 결과에 비해서는 유사성이 감소하지만, 학습 파라미터 수에 관계없이 주어진 입력에 대한 모델의 출력 표현이 서로 비슷하고, 손실 landscape 상에서 LoRA 가중치의 최종 위치가 동일한 분지에 모인다는 사실을 확인하였다. 이러한 성질은 해당 태스크를 학습하여 높은 성능을 얻기 위해 필요한 파라미터 양에 영향을 받는다.
LoRA 학습 결과 사이의 유사성을 바탕으로, 서로 다른 학습 가중치들 사이의 단순 가중합으로 모델 가중치를 대체하는 단순한 모델 앙상블 기법을 파라미터 효율적 학습 결과에 적용하였다. 탐욕 알고리즘으로 모델을 선정했을 때, 추론 시 계산량을 단일 모델 수준으로 유지하면서, 더욱 향상된 성능을 얻을 수 있음을 경험적으로 보였다.
Pretrained Language Models (PLMs) are models that acquired general language skills through pretraining on tasks where the model predicts words that fit in a given text. They are making remarkable progresses in various NLP tasks. Fine tuning a model based on the pretrained weight improves the downstream task performance. It also makes models tuned on the same task close to each other. Based on these characteristics, it is possible to use the averaged weight of multiple fine tuned models as a single model. The averaged model shows improved performance without any additional inference cost.
PLMs with more parameters give better results, and recently, researchers are building Large Language Models (LLMs) with more than 100 billion parameters to make a better model. Although LLMs can outperform existing PLMs, they are too large to fine tune for each downstream task. Therefore, recent studies are proposing Parameter Efficient Transfer Learning (PETL) methods which tune the model with a small number of parameters, instead of updating the entire parameter of the model.
Previous works on PETL suggested a variety of PETL methods whose performances are on par with fine tuning the model itself. However, the similarity of different train results based on PETL methods are lacking. Using LoRA, a representative PETL method, we examine the similarity of different train results on each dataset. Although the uniformity among LoRA train runs is worse compared to the fine tuning results, the output representations of trained models are similar for a given input, and different LoRA weights are in a common basin in the loss landscape. The required amount of trainable parameters to achieve fine tuning level performance affects these characteristics.
Based on the similarity of models trained with LoRA, we tried a simple ensemble method where the weighted sum of all trained weights is used as a weight of a single model. We empirically show that it is possible to obtain an improved model with the same size using greedy approach.

Language: kor

URI: https://hdl.handle.net/10371/193367

https://dcollection.snu.ac.kr/common/orgView/000000175724

Files in This Item:

000000175724.pdf 6.99 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share