Browse

Supervised Feature Representations for Document Classification
문서 분류를 위한 지도학습 기반의 특징 표현

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors
박은정
Advisor
조성준
Major
공과대학 산업·조선공학부
Issue Date
2016-08
Publisher
서울대학교 대학원
Keywords
data miningtext miningdocument classificationdistributional representationsrepresentational learning
Description
학위논문 (박사)-- 서울대학교 대학원 : 산업·조선공학부 데이터마이닝전공, 2016. 8. 조성준.
Abstract
While the traditional method for deriving representations for documents was bag-of-words, they suffered from high-dimensionality and sparsity. Recently, many methods to obtain lower-dimensional and dense distributed representations were proposed. Paragraph vectors is one of such algorithms, which extends the word2vec algorithm by assuming the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different kinds of representations. In this work we propose supervised paragraph vectors, a task-specific variant of paragraph vectors for situations where class labels exist. Essentially, supervised paragraph vectors jointly trains class labels with words and documents so that representations for each class label, words, and documents are obtained with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. For interpretability, we find words that are close and far to class vectors, and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents and class labels in a joint space, and show that our method effectively displays the related words and documents for each class label. For discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers, and achieve comparable classification accuracies to bag-of-words and paragraph vectors. This method is further extended to a semi-supervised version. Finally, a scored-based lexicon is extracted using supervised paragraph vectors, and are applied to short document classification tasks.
Language
English
URI
https://hdl.handle.net/10371/118292
Files in This Item:
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Dept. of Industrial Engineering (산업공학과)Theses (Ph.D. / Sc.D._산업공학과)
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse