S-Space College of Engineering/Engineering Practice School (공과대학/대학원) Dept. of Computer Science and Engineering (컴퓨터공학부) Theses (Ph.D. / Sc.D._컴퓨터공학부)
A Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows
- 공과대학 전기·컴퓨터공학부
- Issue Date
- 서울대학교 대학원
- 학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 이상구.
- Data stream clustering over sliding windows generates clustering results whenever a window moves. However, iterative clustering using all data in a window is highly inefficient in terms of memory and computation time. In this thesis, we address problem of data stream clustering over sliding windows using sliding window aggregation and nearest neighbor search techniques. Our algorithm constructs and maintains temporal group features as a summary of the window using the sliding window aggregation technique. The technique divides a window into disjoint chunks, computes partial aggregates over each chunk, and merges the partial aggregates to compute overall aggregates. To maintain constant size of the summary, the algorithm reduces the size of summary by joining the nearest neighbor. We exploit Locality-Sensitive Hashing for fast nearest neighbor search. We show that Locality-Sensitive Hashing can serve as an effective method for reducing synopses while minimizing the impact on quality. In addition, we also suggest re-clustering policy, which decides whether to append new summary to pre-existing clusters or to perform clustering on whole summary. Our experiments on real-world and synthetic datasets demonstrate that our algorithm can achieve a significant improvement when performing continuous clustering on data streams with sliding windows.