Integrating Neologisms to Pretrained Language Models: Using Assimilation-Inspired Embedding Synthesis

정소영

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Integrating Neologisms to Pretrained Language Models: Using Assimilation-Inspired Embedding Synthesis : 사전 훈련된 언어 모델에 신조어 통합: 동화에서 영감을 받은 임베딩 합성을 활용하여

DC Field	Value	Language
dc.contributor.advisor	전병곤	-
dc.contributor.author	정소영	-
dc.date.accessioned	2023-11-20T04:24:26Z	-
dc.date.available	2023-11-20T04:24:26Z	-
dc.date.issued	2023	-
dc.identifier.other	000000178855	-
dc.identifier.uri	https://hdl.handle.net/10371/196497	-
dc.identifier.uri	https://dcollection.snu.ac.kr/common/orgView/000000178855	ko_KR
dc.description	학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 8. 전병곤.	-
dc.description.abstract	Recent research has shown that pretrained language models (PLMs) can become outdated over time and need adaptation to new words or concepts. While efficient approaches to adapt PLMs to new vocabularies have been studied in the fields of domain or cross-lingual adaptation, these methods have yet to be explored in a setting where vocabulary updates should occur timely, periodically, and on a small scale (adding 1s and 10s of new words with 1MB training data). Unfortunately, such methods either exhibit unsatisfactory performance or result in overfitting to the new vocabularies. Existing works in model editing of PLMs have also been tested to be ineffective for injecting unseen entities into PLMs. Our paper proposes a tailored method — W-SUM — for adapting PLMs to new vocabularies (i.e., neologisms) by mimicking how humans process their internal knowledge when encountering a new word or concept. Inspired by assimilation in Piagets cognitive development theory, W-SUM leverages the rich knowledge inherent in embedding existing tokens of PLMs to find the optimal embedding of a new vocabulary through a weighted sum of existing token embeddings. We let the PLM find the optimal weight distribution via language modeling objective. We evaluate W-SUM on two language model probing tasks – ECBD and LAMBADA and validate W-SUM s ability to acquire a good embedding for new vocabularies through semantic analysis.	-
dc.description.abstract	사전 훈련된 언어 모델(PLM)은 시간이 지남에 따라 새로운 데이터에 대한 성능이 낮아질 수 있으므로 새로운 단어나 개념에 대해 PLM을 적응시킬 필요성이 대두되 었다. PLM을 새로운 어휘에 적응시키는 효율적인 접근 방식이 도메인 또는 교차 언어 적응 분야에서 연구되었지만, 이러한 방법은 어휘 업데이트가 적시에, 주기 적으로, 소규모 (1MB가량의 학습 데이터로 1-10개의 새 단어 추가)로 발생해야 하는 설정에서는 아직 실험된 적이 없다. 위와 같은 방법들은 새로운 세팅에서 만 족스럽지 못한 성능을 나타내거나 새로운 어휘에 과적합되는 결과를 낳는 경향을 보였다. 더 나아가, 기존의 PLM의 모델 편집 테크닉 또한 사전 학습 도중에 보지 못한 정보를 PLM에 주입하는 데에는 효과적이지 못한 것으로 알려져있다. 따라 서 본 논문에서는, 인간이 새로운 단어나 개념을 접할 때 내부 지식을 처리하는 방식을 모방하여 PLM을 새로운 어휘(즉, 신조어)에 적응시키기 위한 맞춤형 방법 W-SUM 을 제안한다. Piaget의 인지 발달 이론에서 동화에 영감을 받아, W-SUM 는 기존 토큰 임베딩의 가중 합계를 통해 새로운 어휘의 최적 임베딩을 찾고, 이 과 정에서 PLM의 기존 토큰 임베딩에 내재된 풍부한 지식을 활용하도록 한다. 가중 합계를 위한 가중치는 PLM이 추가적인 사전학습을 통해서 찾도록 했다. ECBD 와 LAMBADA라는 두 가지 언어 모델 조사 태스크에서 W-SUM 를 평가하고 임 베딩 비교 분석을 통해 W-SUM 이 새로운 어휘에 대한 좋은 임베딩을 획득하는데 효과적이라는 것을 보인다.	-
dc.description.tableofcontents	Introduction 5 Background 8 Method 11 Evaluation 15 Discussion and Conclusion 28	-
dc.format.extent	34	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	Pretrained Language Model	-
dc.subject	Temporal Adaptation	-
dc.subject	Neologism	-
dc.subject.ddc	621.39	-
dc.title	Integrating Neologisms to Pretrained Language Models: Using Assimilation-Inspired Embedding Synthesis	-
dc.title.alternative	사전 훈련된 언어 모델에 신조어 통합: 동화에서 영감을 받은 임베딩 합성을 활용하여	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	Soyoung Jung	-
dc.contributor.department	공과대학 컴퓨터공학부	-
dc.description.degree	석사	-
dc.date.awarded	2023-08	-
dc.identifier.uci	I804:11032-000000178855	-
dc.identifier.holdings	000000000050▲000000000058▲000000178855▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Files in This Item:

000000178855.pdf 2.40 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share