Supervised Feature Representations for Document Classification

박은정

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Supervised Feature Representations for Document Classification : 문서 분류를 위한 지도학습 기반의 특징 표현

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 박은정

Advisor: 조성준

Major: 공과대학 산업·조선공학부

Issue Date: 2016-08

Publisher: 서울대학교 대학원

Keywords: data mining ; text mining ; document classification ; distributional representations ; representational learning

Description: 학위논문 (박사)-- 서울대학교 대학원 : 산업·조선공학부 데이터마이닝전공, 2016. 8. 조성준.

Abstract: While the traditional method for deriving representations for documents was bag-of-words, they suffered from high-dimensionality and sparsity. Recently, many methods to obtain lower-dimensional and dense distributed representations were proposed. Paragraph vectors is one of such algorithms, which extends the word2vec algorithm by assuming the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different kinds of representations. In this work we propose supervised paragraph vectors, a task-specific variant of paragraph vectors for situations where class labels exist. Essentially, supervised paragraph vectors jointly trains class labels with words and documents so that representations for each class label, words, and documents are obtained with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. For interpretability, we find words that are close and far to class vectors, and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents and class labels in a joint space, and show that our method effectively displays the related words and documents for each class label. For discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers, and achieve comparable classification accuracies to bag-of-words and paragraph vectors. This method is further extended to a semi-supervised version. Finally, a scored-based lexicon is extracted using supervised paragraph vectors, and are applied to short document classification tasks.

Language: English

URI: https://hdl.handle.net/10371/118292

Files in This Item:

000000137497.pdf 3.72 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Ph.D. / Sc.D._산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share