Publications

Detailed Information

A Comparison of Oversampling Effects on Imbalanced Topic Classification of Korean Texts : 한국어 주제 분류에서 오버 샘플링 효과 비교

DC Field Value Language
dc.contributor.advisor김청택-
dc.contributor.author서이레-
dc.date.accessioned2017-10-31T08:30:45Z-
dc.date.available2017-10-31T08:30:45Z-
dc.date.issued2017-08-
dc.identifier.other000000146451-
dc.identifier.urihttps://hdl.handle.net/10371/138054-
dc.description학위논문 (석사)-- 서울대학교 대학원 인문대학 협동과정 인지과학전공, 2017. 8. 김청택.-
dc.description.abstractImbalanced data is a widely-acknowledged problem in supervised learning classification tasks. Oversampling is one way to overcome the problem and there are many methods of oversampling that have been discovered. While researches on the effect of oversampling on other languages have been widely conducted, studies comparing oversampling methods on Korean texts are scarce. This study compares the effect of oversampling methods on the task of classifying Korean internet news articles. This study finds that support vector machines (SVM) and logistic regression reacted with stability and performed best when paired with borderline-SMOTE2 in imbalanced conditions.-
dc.description.tableofcontentsIntroduction 1
Machine Learning and Korean Text Classification 1
A Brief Introduction of the Main Classifiers 2
The Problem of Imbalanced Data 5
Approaches to Solve the Problem of Imbalanced Data 6
Literature Review 10
Imbalanced Data in Korean Studies 10
Characteristics of Text Data 11
Characteristics of the Korean Language 13
Research Question 17
Introduction to SMOTE Methods 18
SMOTE 18
Borderline-SMOTE 19
SVM-SMOTE 22
ADASYN 23
A Framework for Comparing the Effectiveness of SMOTE Methods 28
Relevant Factors in Classification Tasks 28
Performance Measures 29
Implementation 31
Text Preparations 31
Method 34
Experiments 36
Study 1: Articles with High Cosine Similarities 36
Study 2: Articles with Low Cosine Similarity 45
Discussion and Conclusion 54
Discussion 54
Conclusion 58
References 59
Appendix 67
국문 초록 80
-
dc.formatapplication/pdf-
dc.format.extent1347027 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectImbalanced data-
dc.subjectKorean text analysis-
dc.subjectoversampling-
dc.subjectSMOTE-
dc.subjectsupervised learning-
dc.subjecttopic classification-
dc.subject.ddc153-
dc.titleA Comparison of Oversampling Effects on Imbalanced Topic Classification of Korean Texts-
dc.title.alternative한국어 주제 분류에서 오버 샘플링 효과 비교-
dc.typeThesis-
dc.contributor.AlternativeAuthorYirey Suh-
dc.description.degreeMaster-
dc.contributor.affiliation인문대학 협동과정 인지과학전공-
dc.date.awarded2017-08-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share