Publications

Detailed Information

Density-based Representation Learning with Applications to Sentiment Analysis and Domain Adaptation : 밀도표현 학습 방법론과 감성분석, 도메인 적응에의 응용

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

박새롬

Advisor
이재욱
Major
공과대학 산업공학과
Issue Date
2018-02
Publisher
서울대학교 대학원
Keywords
representation learningmanifold learningdenoising autoencoderdistributed representationsentiment analysisdomain adaptation
Description
학위논문 (박사)-- 서울대학교 대학원 : 공과대학 산업공학과, 2018. 2. 이재욱.
Abstract
As more and more raw data are created and accumulated, it becomes important to identify information from the data. In order to analyze the collected data, machine
learning and deep learning models are mainly used in recent years, but the performance of these models is highly dependent on data representation. Recent works on representation learning have shown that capturing the input density can be helpful to get useful information from data. Therefore, in this dissertation we focus on density-based representation learning. In high-dimensional data, manifold assumption is one of the important concepts in representation learning because high-dimensional data are actually concentrated near the lower dimensional high density region (manifold).
For unstructured data, converting to numerical vectors is necessary to apply machine learning and deep learning models. In case of text data, distributed representation
learning can effectively reflect information of input data while acquiring continuous vectors of words and documents. In this dissertation, we disentangle some issues on manifold of input data and distributed representation of text data in terms of density-based representation learning.
First, we examine denoising autoencoders (DAE) from the perspective of dynamical systems when the input density is defined as a distribution on manifold. We construct a dynamic projection system associated with the score function, which can be directly obtained from an autoencoder model that is trained from a Gaussian-convoluted input data. Several analytical results for this system are proposed and applied to develop a nonlinear projection algorithm to recognize the high-density region and reduce the noises of corrupted inputs. The effectiveness of this algorithm is verified through the experiments on toy examples and real image benchmarking datasets.
Support vector domain description model can estimate the input density from the trained kernel radius function under some mild conditions on margin and kernel parameters. We propose a novel inductive ensemble clustering method, where kernel support matching is applied to a co-association matrix that aggregates arbitrary basic partitions in order to construct a new similarity for kernel radius function.
Experimental results demonstrate that the proposed method is effective with respect to clustering quality and has robustness to induce clusters of out-of-sample data.
We also develop low-density regularization methods of DAE model by exploiting the energy of the trained kernel radius function. Illustrative examples show that the regularization is effective to pull up the energy outside the support.
Learning document representation is important in applying machine learning algorithms for sentiment analysis. Distributed representation learning models of words and documents, one of neural language models, have been utilized successively in many natural language processing (NLP) tasks including sentiment analysis. However, because such models learn the embeddings only with a context-based objective, it is hard for embeddings to reflect the sentiment of texts. In this research, we address this problem by introducing a semi-supervised sentiment-discriminative objective using partial sentiment information of documents. Our method not only reflects the partial sentiment information, but also preserves local structures induced from original distributed representation learning objectives by considering only sentiment relationships between neighboring documents. Using real-world datasets, the proposed method is validated by sentiment visualization and classification tasks and achieves consistently superior performance to other representation methods in both Amazon and Yelp datasets.
NLP is one of the most important application areas in domain adaptation because a property of texts highly depends on their corpus. Many domain adaptation methods for NLP have been developed based on the numerical representation of texts instead of on textual input. Thus, we develop a distributed representation learning method of documents and words for the domain adaptation that addresses the support separation problem, wherein the supports of different domains are separable. In this study, we propose a new method based on negative sampling. The proposed method learns document embeddings by assuming that noise distribution is dependent on a domain. The proposed method can be divided into two cases according to the dependency of the noise distribution of words on domains when training word embeddings. Through experiments on Amazon reviews, we verify that the proposed methods outperform other representation methods in terms of visualization and proxy A-distance results. We also perform sentiment classification tasks to validate the effectiveness of
document embeddings, and the proposed methods achieve consistently better results compared with other methods.
Recently, there are a large amount of available data that have high dimensional representation or exist in text form, so representation learning to capture manifold
of high-dimensional data and to obtain numerical vectors of text that reflect the useful information is required. Therefore, our algorithms can be helpful to suffice these requirements and applied to various data analytics tasks.
Language
English
URI
https://hdl.handle.net/10371/140588
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share