Publications

Detailed Information

Exploring the Effects of Tokenizer Extension on Domain-specific Fine-tuning of Language Models : 언어 모델의 전문 분야 미세 조정을 통한 토크나이저 확장의 효과 연구

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

김재윤

Advisor
신효필
Issue Date
2025
Publisher
서울대학교 대학원
Keywords
Tokenizer ExtensionMedical Fine-tuningLanguage ModelByte Pair EncodingWordPieceSentencePiece BPEByte-level BPECompression
Description
학위논문(석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2025. 2. 신효필.
Abstract
Large language models (LLMs) that undergo extensive pretraining on massive datasets over long periods have become dominant nowadays. Since it is highly challenging for individuals to train such models from scratch, it has become common practice to fine-tune shared pretrained models for specific tasks. However, when the vocabulary distribution of the data used for fine-tuning significantly differs from the vocabulary which the existing tokenizer can process, issues can arise, such as the tokenizer failing to handle the data properly or excessively fragmenting words into overly short tokens.
Extending the tokenizer by adding new vocabulary items can be an effective solution to mitigate these problems, however, there has been little in-depth research on the specific effects of tokenizer extension. Therefore, this study aimed to analyze the effects of tokenizer extension on domain-specific fine-tuning by training small models, using tokenizers extended with medical data and conducting several analyses. The medical domain, characterized by its frequent use of specialized terminology, was anticipated to benefit positively from tokenizer extension.
Experiments were conducted by extending BPE (Byte Pair Encoding)-based methods, including SentencePiece BPE, Byte-level BPE, and WordPiece, which uses a similar algorithm. The results showed that while the tokenizer's compression capability slightly improved, the memory and time required for model training increased. In addition, evaluated with 4-options multiple choice questions from MultiMedQA, models with extended tokenizer showed worse performance than the models with not extended ones. From these results, it could be concluded that tokenizer extension may not be helpful, when it comes to fine-tuning a language model with domain-specific data.
Language
eng
URI
https://hdl.handle.net/10371/220877

https://dcollection.snu.ac.kr/common/orgView/000000190084
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share