Browse

A Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows

DC Field Value Language
dc.contributor.advisor이상구-
dc.contributor.author연종흠-
dc.date.accessioned2017-10-27T16:40:48Z-
dc.date.available2017-10-27T16:40:48Z-
dc.date.issued2017-08-
dc.identifier.other000000145933-
dc.identifier.urihttps://hdl.handle.net/10371/136790-
dc.description학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 이상구.-
dc.description.abstractData stream clustering over sliding windows generates clustering results whenever a window moves. However, iterative clustering using all data in a window is highly inefficient in terms of memory and computation time. In this thesis, we address problem of data stream clustering over sliding windows using sliding window aggregation and nearest neighbor search techniques. Our algorithm constructs and maintains temporal group features as a summary of the window using the sliding window aggregation technique. The technique divides a window into disjoint chunks, computes partial aggregates over each chunk, and merges the partial aggregates to compute overall aggregates. To maintain constant size of the summary, the algorithm reduces the size of summary by joining the nearest neighbor. We exploit Locality-Sensitive Hashing for fast nearest neighbor search. We show that Locality-Sensitive Hashing can serve as an effective method for reducing synopses while minimizing the impact on quality. In addition, we also suggest re-clustering policy, which decides whether to append new summary to pre-existing clusters or to perform clustering on whole summary. Our experiments on real-world and synthetic datasets demonstrate that our algorithm can achieve a significant improvement when performing continuous clustering on data streams with sliding windows.-
dc.description.tableofcontents1 Introduction 1
2. Preliminaries and Related Work 7
2.1 Data Streams 7
2.2 Window Models 7
2.3 kMeans Clustering 11
2.4 Coreset 12
2.5 Group Features 14
2.6 Related Work 16
2.7 Problem Statement 31
3. GFCS: Group Featurebased Data Stream Clustering with Sliding Windows 35
3.1 2-Level Coresets Construction 35
3.2 2-Level Coresets Maintenance 38
3.3 Clustering on 2-Level Coresets 40
4. CSCS: Coresetbased Data Stream Clustering with Sliding Windows 46
4.1 Coreset Construction based on Nearest Neighbor Search 47
4.2 Coreset Construction based on LocalitySensitive Hashing 60
4.3 Reclustering Policy 66
5. Empirical Evaluation of Data Stream Clustering with Sliding Windows 69
5.1 Experimental Setup 69
5.2 Experimental Results 71
6. Application: Documents Clustering 78
6.1 Vector Representation of Documents 78
6.2 Extension to Other Clustering Algorithms 83
6.3 Evaluation 88
7. Conclusion 95
A. Appendix 109
A.1 Experimental Results of GFCS and CSCS 109
A.2 Experimental Results of Document Clustering 117
-
dc.formatapplication/pdf-
dc.format.extent2498986 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectclustering-
dc.subjectdata streams-
dc.subjectsliding windows-
dc.subjectreal-time processing-
dc.subject.ddc621.3-
dc.titleA Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows-
dc.typeThesis-
dc.description.degreeDoctor-
dc.contributor.affiliation공과대학 전기·컴퓨터공학부-
dc.date.awarded2017-08-
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Dept. of Computer Science and Engineering (컴퓨터공학부)Theses (Ph.D. / Sc.D._컴퓨터공학부)
Files in This Item:
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse