자연어 처리에서 코퍼스의 크기와 문맥이 과제 정확도에 미치는 영향

이준호

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

자연어 처리에서 코퍼스의 크기와 문맥이 과제 정확도에 미치는 영향 : The Effect of Corpus Size and Context on Task Accuracy in Natural Language Processing

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 이준호

Advisor: 김청택

Issue Date: 2020

Publisher: 서울대학교 대학원

Keywords: 코퍼스 ; 문맥 ; 워드 임베딩 ; Word2Vec ; GloVe ; fastText ; Corpus ; Context ; Word embedding,

Description: 학위논문 (석사) -- 서울대학교 대학원 : 인문대학 협동과정 인지과학전공, 2020. 8. 김청택.

Abstract: 최근 자연어 처리 연구는 거대한 데이터로 만든 복잡한 모델이 주류를 이루고 있다. 그러나 항상 더 큰 데이터가 더 나은 성능을 보장하지는 않는다. 본 논문에서는 가장 널리 쓰이는 세 가지 워드 임베딩 기법에서 코퍼스의 크기와 문맥이 각각 과제 정확도에 어떻게 영향을 미치는지 영화평 감성 분석을 통해 확인하였다.
분석 결과, 모든 코퍼스 크기에서 임베딩 모델의 학습 코퍼스와 과제 코퍼스 간의 문맥 유사도가 높을수록 높은 성능이 나타났다. 또한, 문맥 유사도가 가장 높은 코퍼스에 비해 문맥 유사도가 떨어지는 코퍼스들은 코퍼스의 크기가 기준값의 15배, 17배, 53배 이상 커지더라도 더 낮은 성능을 보였다.
본 분석에서 세 워드 임베딩 기법은 각각 다른 특성을 보였다. Word2Vec은 비교적 학습이 빠르고 안정적인 성능을 보였으며, GloVe는 코퍼스의 크기가 일정 크기 이상이 되면 오히려 정확도가 감소하고 모델 간 정확도의 편차가 증가하였다. fastText는 한국어의 특성을 반영했을 때 가장 좋은 성능을 보였으나, 분석에 소요되는 시간이 다른 기법들의 2배 정도로 낮은 계산 효율성을 보였다.
본 연구는 워드 임베딩 기법을 활용한 자연어 처리에서 학습 코퍼스와 과제 코퍼스의 문맥 유사도를 정량적으로 측정하고, 학습-과제 코퍼스 간 문맥 유사도와 학습 코퍼스의 크기에 따라 과제 정확도가 어떻게 변화하는지 상세히 확인하였다는 것에 의의가 있다.
Recent studies for natural language processing are being dominated by complex models based on big data. However, bigger data does not always guarantee better performance. In this paper, we confirmed how the size and context of the corpus each affect the accuracy of the task. The task was sentiment analysis on movie reviews and we used the three most widely used word embedding techniques: Word2Vec, GloVe, fastText.
The result of the analysis showed that the higher the context similarity between the training corpus of the embedding model and the task corpus, the higher the evaluation performance will be in all corpus sizes. The corpus with the lesser context similarity yielded lower performance even if the training corpus size was 15, 17, and 53 times bigger than its reference size.
The three-word embedding techniques showed different accuracy patterns depending on the size and context of the corpus. In specific, Word2Vec showed fast and stable performance, while GloVe showed a decrease in evaluation performance and the deviation of accuracy between models has increased when the size of the training corpus became bigger. On the other hand, fastText showed the best performance when reflecting on the characteristics of Korean text, but the time required for the analysis was about twice that of other techniques.
In summary, the present study measured quantitatively the context similarity between training corpus and task corpus in natural language processing using the word embedding technique. In addition, we confirmed in detail how the task accuracy changes according to the context similarity between the training-task corpus and the size of the training corpus.

Language: kor

URI: https://hdl.handle.net/10371/170615

http://dcollection.snu.ac.kr/common/orgView/000000162381

Files in This Item:

000000162381.pdf 1.86 MB

Appears in Collections:

College of Humanities (인문대학)
- Program in Cognitive Science (협동과정-인지과학전공)
  - Theses (Master's Degree_협동과정-인지과학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share