S-Space College of Engineering/Engineering Practice School (공과대학/대학원) Dept. of Industrial Engineering (산업공학과) Theses (Ph.D. / Sc.D._산업공학과)
Supervised Feature Representations for Document Classification
문서 분류를 위한 지도학습 기반의 특징 표현
- 공과대학 산업·조선공학부
- Issue Date
- 서울대학교 대학원
- data mining; text mining; document classification; distributional representations; representational learning
- 학위논문 (박사)-- 서울대학교 대학원 : 산업·조선공학부 데이터마이닝전공, 2016. 8. 조성준.
- While the traditional method for deriving representations for documents was bag-of-words, they suffered from high-dimensionality and sparsity. Recently, many methods to obtain lower-dimensional and dense distributed representations were proposed. Paragraph vectors is one of such algorithms, which extends the word2vec algorithm by assuming the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different kinds of representations. In this work we propose supervised paragraph vectors, a task-specific variant of paragraph vectors for situations where class labels exist. Essentially, supervised paragraph vectors jointly trains class labels with words and documents so that representations for each class label, words, and documents are obtained with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. For interpretability, we find words that are close and far to class vectors, and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents and class labels in a joint space, and show that our method effectively displays the related words and documents for each class label. For discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers, and achieve comparable classification accuracies to bag-of-words and paragraph vectors. This method is further extended to a semi-supervised version. Finally, a scored-based lexicon is extracted using supervised paragraph vectors, and are applied to short document classification tasks.