Sparse hypergraph clustering and its applications : 희박한 하이퍼그래프의 클러스터링과 그 활용

Cited 0 time in Web of Science Cited 0 time in Scopus


사범대학 수학교육과
Issue Date
서울대학교 대학원
hypergraphclusteringmaximum weight independent setcliquelinkage disequilibriumLD blocksingle nucleotide polymorphismpositive selection
학위논문 (박사)-- 서울대학교 대학원 사범대학 수학교육과, 2017. 8. 유연주.
Clustering is one of the most popular methods to extract meaningful patterns from data. In genomics, increasingly large amounts of DNA sequencing data are being generated. Developing effective clustering tools appropriate for each data structure is a major challenge. In this thesis, we develop the CLQ and CLQ-D clustering algorithm to partition SNP sequence data using theoretical graph-based approaches. Based on these clustering algorithms, the LD block construction method, Big-LD, is able to detect the LD blocks of SNP sequence data using interval graph modeling of the clustering results. A sparse graph is defined as a graph in which the actual number of edges is much less than the possible number of edges. Real-world data including biological, social, and internet network data can be modeled as sparse graphs. Due to the structural characteristic of SNP data, graph models constructed for the clustering algorithm and LD block construction algorithm have a sparse structure, which facilitates the efficient operation of the algorithm in terms of time and memory usage. The Big-LD algorithm detects LD blocks including "holes", which are not allowed in the previous methods. Based on the LD block structure constructed by the Big-LD algorithm, we investigated the relationships between big LD block structure and biological phenomena using the HapMap phase 3 data and phase 1 data of the 1000 Genomes Project. The LD block boundaries detected by the Big-LD algorithm coincided better with the recombination hotspots than previous methods. In addition, we demonstrate that the comparison of LD block structures can provide additional information about positive selection using the results applied to the candidate regions suggested by previous research. By generalizing the Big-LD algorithm, which is designed to partition SNP sequence data into blocks, we suggest four clustering algorithms—PSHSC, PSHRC, PSHSQ, and PSHRQ—for the sparse hypergraph partitioning problem. Simulation experiments demonstrated that the algorithms generate high-quality partitions in terms of global and local quality measures. The partitioning results closely agreed with the true underlying cluster structures of simulated hypergraphs. We also applied the developed algorithm to the problem of predicting protein complexes in yeast protein-protein interaction network data, and confirm its potential as a tool for clustering biological network datasets.
Files in This Item:
Appears in Collections:
College of Education (사범대학)Dept. of Mathematics Education (수학교육과)Theses (Ph.D. / Sc.D._수학교육과)
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.