Publications
Detailed Information
Exploring the Effects of Tokenizer Extension on Domain-specific Fine-tuning of Language Models : 언어 모델의 전문 분야 미세 조정을 통한 토크나이저 확장의 효과 연구
Cited 0 time in
Web of Science
Cited 0 time in Scopus
- Authors
- Advisor
- 신효필
- Issue Date
- 2025
- Publisher
- 서울대학교 대학원
- Keywords
- Tokenizer Extension ; Medical Fine-tuning ; Language Model ; Byte Pair Encoding ; WordPiece ; SentencePiece BPE ; Byte-level BPE ; Compression
- Description
- 학위논문(석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2025. 2. 신효필.
- Abstract
- Large language models (LLMs) that undergo extensive pretraining on massive datasets over long periods have become dominant nowadays. Since it is highly challenging for individuals to train such models from scratch, it has become common practice to fine-tune shared pretrained models for specific tasks. However, when the vocabulary distribution of the data used for fine-tuning significantly differs from the vocabulary which the existing tokenizer can process, issues can arise, such as the tokenizer failing to handle the data properly or excessively fragmenting words into overly short tokens.
Extending the tokenizer by adding new vocabulary items can be an effective solution to mitigate these problems, however, there has been little in-depth research on the specific effects of tokenizer extension. Therefore, this study aimed to analyze the effects of tokenizer extension on domain-specific fine-tuning by training small models, using tokenizers extended with medical data and conducting several analyses. The medical domain, characterized by its frequent use of specialized terminology, was anticipated to benefit positively from tokenizer extension.
Experiments were conducted by extending BPE (Byte Pair Encoding)-based methods, including SentencePiece BPE, Byte-level BPE, and WordPiece, which uses a similar algorithm. The results showed that while the tokenizer's compression capability slightly improved, the memory and time required for model training increased. In addition, evaluated with 4-options multiple choice questions from MultiMedQA, models with extended tokenizer showed worse performance than the models with not extended ones. From these results, it could be concluded that tokenizer extension may not be helpful, when it comes to fine-tuning a language model with domain-specific data.
- Language
- eng
- Files in This Item:
- Appears in Collections:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.