Algorithm Engineering on Multikey Quicksort and Multi-byte String Matching

김은상

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Algorithm Engineering on Multikey Quicksort and Multi-byte String Matching : 멀티키 퀵소트와 다중바이트 문자열 검색의 알고리즘 엔지니어링

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김은상

Advisor: 박근수

Major: 전기·컴퓨터공학부

Issue Date: 2012-02

Publisher: 서울대학교 대학원

Abstract: In classical algorithmics, the analysis of algorithms for combinatorial problems focused on the theoretical analysis of asymptotic worst-case run-times. However, asymptotically fast algorithms might not be efficient in practice. In addition, with growing requirements for innovative algorithms, the gap between theory and practice grows larger. Algorithm engineering copes with these problems and bridges the gap between the efficient algorithms developed in algorithmic theory and the algorithms used by practitioners.
In this thesis, we study algorithm engineering on string matching problem and focus on two issues. First, we consider multikey quicksort for sorting strings which is a very important factor of suffix array construction algorithms. Bentley and Sedgewick proposed multikey quicksort with 'split-end' partitioning for sorting strings. But it can be slow in case of many equal element strings, because 'split-end' partitioning moves equal elements to the ends and swaps back to the middle. We improved multikey quicksort in such a case. First, we improve the operation of swapping equal elements from the ends back to the middle in 'split-end' partitioning. Then we improve multikey quicksort by using another partitioning method called 'collect-center' partitioning, instead of 'split-end' partitioning. In case of many equal element strings such as DNA sequences, HTML files, and English texts, multikey quicksort with 'collect-center' partitioning is faster than that with 'split-end' partitioning.
Second, we consider a problem of false matches on multi-byte character set texts. In practice, how the pattern and the text are encoded can make the classic string matching algorithms report false matches. But, there have been few researches on the matching on multi-byte texts and especially there is no result based on suffix approach such as Boyer-Moore algorithm, Horspool algorithm, and its variant which can skip quickly over the text areas that are seen not to contain a match thus have a fast searching speed. We present two algorithms for searching strings on multi-byte texts in this thesis. First, we improve KMP algorithm by adopting a character-based prefix function to avoid false matches on multi-byte texts. Next, we improve Sunday algorithm, a variant of Horspool algorithm, by adopting a character-based shift table. A prefix function with character as a unit and a shift table with character as a unit are proposed for the first time. These algorithms do not only eliminate false matches, but also ensure a fast searching speed without a large slow-down, compared with based algorithms.
고전적인 알고리즘에서는, 조합 문제를 위한 알고리즘의 분석으로써 점근적인 최악 시간 복잡도의 이론적인 분석에 초점을 맞추었다. 그러나 점근적으로 빠른 알고리즘들이 실제로 효율적이지 않을 수 있다. 게다가 혁신적인 알고리즘들에 대한 요구가 늘어남과 동시에, 이론과 실제 사이의 차이는 날로 커져가고 있다. 알고리즘 엔지니어링은 이러한 문제에 대처하여, 알고리즘 이론으로 개발된 효율적인 알고리즘들과 실제로 사용되는 알고리즘들의 차이를 메우는 교량 역할을 한다.
본 논문에서는 문자열 검색 문제에 대한 알고리즘 엔지니어링을 다루고 있으며 다음과 같은 두가지 주제에 초점을 맞추고 있다. 첫째로, 접두사 배열 생성 알고리즘에서 아주 중요한 부분인 문자열 정렬을 위한 멀티키 퀵소트를 개선한 결과를 제시한다. Bentley와 Sedgewick은 'split-end' 분할을 적용한 멀티키 퀵소트를 제안하였다. 그러나 'split-end' 분할 방식은 기준 원소와 동일한 원소들을 배열의 양 끝으로 이동시킨 후에 다시 중앙으로 모으는 작업을 하기 때문에, 문자열에 기준 원소와 동일한 원소들이 많이 있는 경우 느려질 수 있다. 우리는 이러한 경우에 멀티키 퀵소트를 개선한다. 먼저, 'split-end' 분할 작업을 개선한 결과를 제시한다. 그리고 우리는 'split-end' 분할 대신에 새로 제시하는 'collect-center' 분할 방식을 적용하여 멀티키 퀵소트를 개선한다. DNA 염기서열, HTML 파일, 그리고 영문 텍스트와 같이 기준 원소와 동일한 원소가 많은 문자열을 포함하는 경우에 collect-center 분할을 적용한 멀티키 퀵소트는 'split-end' 분할을 적용한 것보다 빠른 속도를 보인다.
둘째로, 우리는 다중바이트 문자집합 텍스트에서 오검색을 제거한 결과를 제시한다. 실제로, 패턴과 텍스트가 어떻게 인코딩되어 있는지에 따라서 기존의 문자열 알고리즘은 오검색을 할 수 있는 가능성이 있다. 그러나 다중바이트 텍스트 상에서의 문자열 검색에 관한 연구는 거의 이루어지지 않았으며, 특히 Boyer-Moore 알고리즘, Horspool 알고리즘 등과 같은 접미사 접근방식에 기반한 결과는 지금까지 알려진 바가 없다. 우리는 본 논문에서 다중바이트 텍스트에서의 문자열 검색을 위한 두가지 알고리즘을 제시한다. 먼저 우리는 오검색을 피할 수 있도록 문자단위 접두사 함수를 적용하여 KMP 알고리즘을 개선한다. 다음으로 우리는 문자단위 이동 테이블을 적용하여 Horspool 알고리즘의 변형인 Sunday 알고리즘을 개선한다. 문자단위의 접두사 함수와 문자단위의 이동 테이블은 처음으로 본 논문에서 제시되는 것이다. 이 알고리즘들은 오검색을 제거할 뿐만 아니라 기반이 되는 알고리즘들과 비교했을 때 큰 속도 저하 없이 빠른 속도를 보장한다.

Language: eng

URI: https://hdl.handle.net/10371/156586

http://dcollection.snu.ac.kr:80/jsp/common/DcLoOrgPer.jsp?sItemId=000000000822

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share