Publications

Detailed Information

Leveraging More Fine-grained Representation to Reduce Instability within Word Embeddings

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

Park, Suzi; Shin, Hyopil

Issue Date
2019-11
Publisher
한국언어정보학회
Citation
언어와 정보, Vol.23 No.3, pp.1-18
Abstract
Suzi Park and Hyopil Shin. 2019. This paper investigates the notion of instability in word embeddings, primarily focused on the Korean language, by analyzing the effects of instability on several tasks using word level, character level, and sub-character level embeddings. Korean has a complex morphological system and is written at units smaller than that of a single character, making it optimal for comparing the three types of embeddings. We first carry out several intrinsic evaluations such as word similarity and analogy tests, as well as an extrinsic evaluation of the embeddings on the task of sentiment analysis. These results show that sub-character embeddings outperform the other two types, obtaining both higher and more consistent results, especially when the corpus size is large. Based on these findings, our hypothesis is that embeddings trained on smaller units and larger corpora would also be a better choice in terms of stability. To investigate this, we implemented two stability tests, looking into cosine similarity standard deviations and intersection of nearest neighbors among models. Our findings demonstrate that indeed, the smaller the embedding unit and the larger the corpus, the more stable the embedding type is.
Importantly, our results illustrate the necessity of taking instability into consideration when using word embeddings within a given experiment.
ISSN
1226-7430
URI
https://hdl.handle.net/10371/195079
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share