S-Space Kyujanggak Institute for Korean Studies (규장각한국학연구원) Korean Culture (한국문화) Korean Culture (한국문화) vol.31 (2003)
어휘 계량적 분석과 띄어쓰기 문제
- Issue Date
- 서울대학교 규장각한국학연구원
- 한국문화, Vol.31, pp. 49-76
- It is well known that one of the most trouble-making problems in word frequency analysis of modern Korean corpora is irregulalrities in word spacing, especially that of the MWLUs(milti-word lexical units) including compounds. This comes from the facts that articles which regulate spacing of modern Korean have some contradictions and unclarity on one side, and that it is impossible to register all the MWI, US and compound words, even the full size dictionary, which can be used as a source of referencefor word-spacing, and most of the lexicons of language processing tools of Korean depend on paper dictionaries on the other. As a result, lists of compounds in word frequency lists show inconsistacy, and this influences the whole results of frequency analysis of a corpus. It is argued that to overcome such problems, it is preferable to make a list of compound words and MWLUs based on the corpus to be analysed, and the lexicon of language processing tools must be reorganized based on the list of compound words and MWLUs. And as this list can be used as a source of supplement for the revision of the dictionary which originally used for the word frequency analysis, the whole process of word frequency analysis shows circularity.