Subject-Verb-Object Tuple Embedding Model: Explanation and Evaluation

김준혁

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Subject-Verb-Object Tuple Embedding Model: Explanation and Evaluation : 주어-동사-목적어 튜플의 임베딩 모형

DC Field	Value	Language
dc.contributor.advisor	박병욱	-
dc.contributor.author	김준혁	-
dc.date.accessioned	2019-05-07T04:34:18Z	-
dc.date.available	2019-05-07T04:34:18Z	-
dc.date.issued	2019-02	-
dc.identifier.other	000000153878	-
dc.identifier.uri	https://hdl.handle.net/10371/151618	-
dc.description	학위논문 (석사)-- 서울대학교 대학원 : 자연과학대학 통계학과, 2019. 2. 박병욱.	-
dc.description.abstract	In NLP, a mapping from text data to Euclidean space is crucial in modeling process. We implement tuple embedding model by using subject, verb, object tuple data extracted from Reuters news headlines. Our experiments show clear syntactic relationships among various headline vectors by using triplet loss function. Headline vectors tend to gather around similar verbs and make different clusters according to similarities of subjects and objects. Interestingly, if we increase the size of the model, existing clusters are divided again to more various actors or actions. Major weakness is that headline vectors are sensitive to order, shape and the number of words in their subject-verb-object because of model architecture and the use of average vectors. Especially, using average vector produces good results only when the number of words in subject-verb-object are short. However, when headline vectors have many words, it produces arbitrary results. Hence, tackling the problem of average vector can be the key to generate more sophisticated performance.	-
dc.description.abstract	자연어 처리(NLP) 분야에서, 텍스트 데이터를 유클리디안 공간으로 맵핑하는 작업은 모델링 과정에서 핵심적인 역할을 한다. 이 논문에서는 로이터 기사 헤드라인에서 추출된 주어-동사-목적어의 튜플 데이터를 임베딩하는 모형을 적용해보았다. 특히 삼중항 손실 함수를 활용한 결과, 기존 손실함수에 비해 구문론적으로 더욱 분명한 관계를 보였다. 헤드라인 벡터들이 먼저 유사한 동사(행동)를 기준으로 큰 클러스터를 형성하였고, 또한 주어와 목적어의 유사성에 따라 세부적으로 다른 클러스터들로 구분되는 결과를 보였다. 흥미로운 점은 모형의 복잡성을 높이게 되면 기존의 클러스터들이 주체에 따라 세분화된다는 점이다. 모형의 큰 단점은 주어-동사-목적어의 순서와 이를 구성하는 단어의 개수에 민감하다는 점에 있다. 이는 모형의 구조와 평균 단어벡터의 활용 때문이다. 특히, 평균 벡터의 활용은 주어-동사-목적어를 구성하는 단어의 개수가 적은 경우에 좋은 성능을 보인다. 반면, 단어의 개수가 많아지면 비슷한 의미의 단어일지라도 그 관계가 임의적으로 변하는 단점이 있다. 따라서, 평균 벡터를 활용하는 문제점을 극복하는 것이 모형의 성능을 높이는 핵심이라고 할 수 있다.	-
dc.description.tableofcontents	1. Introduction 1.1 Introduction 1.2 Related Studies 2. Subject-Verb-Object Tuple Embedding Models 2.1 Baseline model 2.2 Incorporating word vectors as learning parameters 2.3 Some modified loss functions 2.3.1 Dynamic margin loss function 2.3.2 Triplet loss function 3. Data 3.1 Gathering news data 3.2 Data preprocessing 3.3 Extract S-V-O tuples 3.4 Word embedding 4. Experiments and Evaluation 4.1 Data descriptions 4.2 Generating corrupted data 4.3 Tuple embedding models 4.4 Qualitative analysis 4.5 Selected topics: triplet loss 4.6 Selected topics: baseline model 4.7 Random sampling 4.8 Main verb and its derivative verb phrases 5. Discussion and Conclusion 5.1 Weakness: averaging word vectors 5.2 Loss functions 5.3 The direction of data preprocessing 5.4 Conclusion References	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject.ddc	519.5	-
dc.title	Subject-Verb-Object Tuple Embedding Model: Explanation and Evaluation	-
dc.title.alternative	주어-동사-목적어 튜플의 임베딩 모형	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	Kim. Jun Hyeok	-
dc.description.degree	Master	-
dc.contributor.affiliation	자연과학대학 통계학과	-
dc.date.awarded	2019-02	-
dc.identifier.uci	I804:11032-000000153878	-
dc.identifier.holdings	000000000026▲000000000039▲000000153878▲	-

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Dept. of Statistics (통계학과)
  - Theses (Master's Degree_통계학과)

Files in This Item:

000000153878.pdf 4.80 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share