Similarity Query Processing Techniques for Text Data

김영훈

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Similarity Query Processing Techniques for Text Data : 텍스트 데이터에 대한 유사도질의 처리기술

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김영훈

Advisor: 심규석

Major: 공과대학 전기·컴퓨터공학부

Issue Date: 2013-02

Publisher: 서울대학교 대학원

Keywords: Similarity queries ; exact substring matching ; top-k approximate substring ; matching ; top-k approximate string joins

Description: 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 2. 심규석.

Abstract: 인터넷이 널리 사용되기 시작하며 수많은 텍스트 데이터가 생성되고 텍스트 안에서 효율적으로 유사 문자열을 찾는 방법 또한 중요한 문제가 되었다. 여러 응용 프로그램에서 유사도 질의는 필수적이며 매우 유용한 부분을 차지하고 있다. 기존의 유사도 질의는 사용자가 정한 문턱값에 대해서 질의 문자열과 문서의 유사도가 주어진 문턱값보다 작은 모든 문서를 찾는 접근 방법을 사용하였다. 그러나 사용자가 적당한 문턱값을 미리 알 수가 없기 때문에 적당한 값을 찾기 위해 여러 번 다른 값을 이용해 질의를 해봐야 하는 문제가 있다. 따라서 문턱값을 사용하지 않고 유사한 K개의 문서를 찾는 방법이 최근 활발히 연구되고 있다.

본 논문에서 우선 Google, Bing, Yahoo!과 같은 웹 검색엔진을 비롯한 많은 응용 프로그램에서 실질적으로 유용하게 사용되는 유사도 질의에 대한 대표적인 응용 프로그램을 제시한다. 이 유사도 질의를 효율적으로 처리하기 위해서는 부분 문자열 검색, K근사 부분 문자열 검색 그리고 K근사 문자열 조인에 대한 빠른 질의 처리 방법이 필요하다. 따라서 우리는 이러한 유사도 질의들을 효율적으로 처리하는 문제에 대해 연구하고자 한다.

부분 문자열 검색을 위해 가변 길의 Q그램을 사용한 역 색인을 이용하여 최적의 질의 계획을 찾는 최적(optimal) 알고리즘을 우선 소개하였다. 또한 최적의 질의 계획을 찾기 위해 고려해야 하는 경우의 수가 기하급수적으로 크기 때문에 최적에 근사한 질의 계획을 찾는 근사(approximate) 알고리즘도 개발하였다. K근사 부분 문자열 검색을 위해서는 본 연구에서 제안한 부분문자열 최단편집길이(edit distance)의 최소한도(lower bound) 계산법을 이용한 필터링 기술을 개발하였다. 또한 이 필터링 기술을 응용한 효율적인 K근사 문자열 검색 알고리즘을 제안하였다. 다음으로 우리는 K근사 문자열 조인을 위해 조인에 고려해야 하는 다수의 문서 쌍을 미리 제거할 수 있는 분할(partitioning) 기술을 제안하고 이를 이용한 효율적인 알고리즘을 제안하였다. 그리고 실험을 통해 본 논문에서 제안한 알고리즘들이 매우 효과적으로 질의 처리 비용을 줄일 수 있음을 확인하였다.

본 논문에서 연구한 유사도 질의는 웹 검색엔진과 같은 많은 응용 프로그램에 필요하므로 우리가 제안한 알고리즘이 그러한 응용 프로그램의 성능을 효과적으로 향상시킬 수 있을 것이라 생각한다.
With the widespread use of the internet, text-based data sources have become ubiquitous and the demand for effective support of similarity queries in text data continues to increase. While the applications for text similarity queries are diverse, similarity queries are essential and useful in those applications. The traditional approach for similarity queries is to find every similar document satisfying a given minimum similarity threshold which is required to be provided by each user. However, since users have to try repeatedly different minimum similarity threshold to find a proper value, the schemes for top-k similarity query processing are more practical and have been extensively explored recently.

In this dissertation, we first introduce motivating applications utilizing the text similarity queries which are practically applied in many applications including commercial search engines such as Google, Bing and Yahoo!. The useful similarity queries usually consist of exact substring matching, top-k approximate substring matching and top-k approximate string joins. Thus, we investigate the problem of processing these similarity queries efficiently.

For exact substring matching, we propose the optimal algorithm to find the best query plan utilizing inverted variable-length gram indexes. We also develop the approximate algorithms to overcome the exponential search space for finding an optimal plan. For top-k approximate substring matching, we propose effective filtering techniques utilizing our novel lower bounds for substring edit distance. Then, we develop efficient algorithms for top-k approximate substring matching applying our filtering techniques. For top-k approximate string joins, we devise efficient algorithms as well as the essential pair partitioning technique which allows us to effectively ignore many pairs of strings in advance. Our experiments show that the proposed algorithms reduce the query processing cost very efficiently.

Since the similarity queries studied in this dissertation are widely required in many applications such as web search engines, we believe that our proposed algorithms will enhance the performance of those applications practically.

Language: English

URI: https://hdl.handle.net/10371/118899

Files in This Item:

000000009666.pdf 2.24 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share