Exploring the Effects of Tokenizer Extension on Domain-specific Fine-tuning of Language Models

김재윤

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Exploring the Effects of Tokenizer Extension on Domain-specific Fine-tuning of Language Models : 언어 모델의 전문 분야 미세 조정을 통한 토크나이저 확장의 효과 연구

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김재윤

Advisor: 신효필

Issue Date: 2025

Publisher: 서울대학교 대학원

Keywords: Tokenizer Extension ; Medical Fine-tuning ; Language Model ; Byte Pair Encoding ; WordPiece ; SentencePiece BPE ; Byte-level BPE ; Compression

Description: 학위논문(석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2025. 2. 신효필.

Abstract: Large language models (LLMs) that undergo extensive pretraining on massive datasets over long periods have become dominant nowadays. Since it is highly challenging for individuals to train such models from scratch, it has become common practice to fine-tune shared pretrained models for specific tasks. However, when the vocabulary distribution of the data used for fine-tuning significantly differs from the vocabulary which the existing tokenizer can process, issues can arise, such as the tokenizer failing to handle the data properly or excessively fragmenting words into overly short tokens.
Extending the tokenizer by adding new vocabulary items can be an effective solution to mitigate these problems, however, there has been little in-depth research on the specific effects of tokenizer extension. Therefore, this study aimed to analyze the effects of tokenizer extension on domain-specific fine-tuning by training small models, using tokenizers extended with medical data and conducting several analyses. The medical domain, characterized by its frequent use of specialized terminology, was anticipated to benefit positively from tokenizer extension.
Experiments were conducted by extending BPE (Byte Pair Encoding)-based methods, including SentencePiece BPE, Byte-level BPE, and WordPiece, which uses a similar algorithm. The results showed that while the tokenizer's compression capability slightly improved, the memory and time required for model training increased. In addition, evaluated with 4-options multiple choice questions from MultiMedQA, models with extended tokenizer showed worse performance than the models with not extended ones. From these results, it could be concluded that tokenizer extension may not be helpful, when it comes to fine-tuning a language model with domain-specific data.

Language: eng

URI: https://hdl.handle.net/10371/220877

https://dcollection.snu.ac.kr/common/orgView/000000190084

Files in This Item:

000000190084.pdf 2.52 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share