Browse

A statistical analysis for next-generation sequencing data with a small number of samples
자료수가 적은 차세대 염기서열자료의 통계적 분석

DC Field Value Language
dc.contributor.advisor박태성-
dc.contributor.author김정수-
dc.date.accessioned2017-07-14T06:01:36Z-
dc.date.available2017-07-14T06:01:36Z-
dc.date.issued2014-02-
dc.identifier.other000000017636-
dc.identifier.urihttps://hdl.handle.net/10371/125373-
dc.description학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2014. 2. 박태성.-
dc.description.abstractWith an advance of technology, new methods to meet a more suitable analysis that ever has been made, need to be developed. Since the microarray technology had been developed, plenty of methods have been invented, from genome-wide association analysis, which detects causative variants associated with diseases, to differential expression analysis, which identifies genes with dissimilar in abundance. In the early era, when the data was generated at great expense, researcher devoted to develop a method for the analysis of studies with small sample size. However, fast stabilization and incompleteness of the microarray technology lead many studies with larger sample size.
The efforts made by numerous scientists were concentrated on incorporating revisions into new methods for an analysis of microarray data. Therefore, microarray technology has experienced fast stabilization. In microarray technology, the information of interest should be pre-acquired and placed on a limited space as a set of probes. Because of this property of microarray technology, there has been limits to the amount and the variety of information we can access. Thus it is more suitable for detecting common information rather than individual-specific information with microarray. Thus, rather than small sample studies, microarray technology dedicated to large sample studies to elucidate common phenomena observed in a large sample.
Next-generation sequencing (NGS) technology is inherently suitable for detecting individual information. It was a well match between NGS technology and the personalized concept from the start of Human Genome Project. However, it is not easy to clarify the meaningful information from an individual data with a large amount of 1 base-pair resolution scale. Furthermore, relatively high cost and limited specimen availability often lead to studies with small samples (replicates). Eventually, to obtain results with significance from data with a small number of samples attracts researchers attention.
In this thesis, the approaches to genomic data and transcriptomic data both with small sample sizes will be provided. Specifically, for genomic data analysis, a new strategy called multiphasic analysis is suggested. Applying the strategy to a Mendelian disease, the strategy shows how it efficiently weed out a disease-causing variant from various candidates.
For transcriptomic data analysis, a new method is proposed for analysis of differential expression analyses between two classes, which can be applicable to RNA-Seq data with a small (even with non-replicated) number of replicates. the validity of the proposed method is provided by applying it to various real and simulated datasets and comparing the results to those obtained from other competing methods.
-
dc.description.tableofcontentsABSTRACT i
CONTENTS v
LIST OF FIGURES viii
LIST OF TABLES ix

1 INTRODUCTION 11
1.1 Background of Omics 11
1.1.1 Genomics 13
1.1.2 Transcriptomics 14
1.1.3 Proteomics 15
1.2 Technologies for High-throughput Omics Data 17
1.2.1 Microarray technology 17
1.2.2 Next-generation Sequencing (NGS) technology 19
1.2.3 Mass Spectrometry 20
1.3 An Analysis of NGS Data with a Small Samples 22
1.3.1 Necessity for the Small Sample Analysis 22
1.3.2 Purpose and Novelty of this study 24
1.4 Outline of the thesis 26

2 GENOMIC DATA ANALYSIS 27
2.1 Introduction 28
2.1.1 Genomic variants and diseases association 28
2.1.2 Population-based vs. Family-based studies 31
2.1.3 NGS and Family-based studies 33
2.2 Overview on Existing Approaches with Small Samples 35
2.2.1 Filtering and screening analysis 37
2.2.2 Linkage analysis 38
2.2.2 Copy number variation (CNV) analysis 39
2.3 Prioritizing Disease Causing Variants with Small Deafness Family 41
3.3.1 Background of the study 41
3.3.2 Background of the nonsyndromic hearing loss (NSHL) 43
2.4 Materials and Methods 45
3.4.1 Subjects 45
3.4.2 Audiometric analysis 46
3.4.3 Whole exome sequencing (WES) 46
3.4.4 SNV analysis with WES data 47
3.4.5 Linkage analysis with WES and SNP microarray 48
2.5 Results 50
2.5.1 Clinical features of a NSHL family 50
2.5.2 Copy number analysis using WES data 51
2.5.3 Exome linkage analysis 53
2.5.4 CNV analysis 54
2.6 Conclusion & Discussion 58

3 TRANSCRIPTOMIC DATA ANALYSIS 63
3.1 Introduction 64
3.1.1 From Microarray to RNA-Seq 64
3.1.2 Overview on RNA-Seq Analysis 66
3.1.3 Differential expression (DE) analysis with small samples 68
3.2 A Review of Existing Methods 70
3.2.1 edgeR 71
3.2.2 DESeq 72
3.2.3 NBPSeq 74
3.2.4 NOISeq 75
3.3 Local Pooled Error (LPE) Method 76
3.3.1 A Brief Introduction to LPE Method 77
3.3.2 LPE method revisited 79
3.3.3 Extension of LPEseq 83
3.3.4 A Toy Example 86
3.3.5 Comparison with other methods 89
3.3.6 Additional Extension of the LPE method 90
3.4 Real Data Application 93
3.4.1 Preparing read datasets 93
3.4.2 Results 94
3.5 Simulation Study 99
3.5.1 Generating simulation datasets 99
3.5.2 Simulation results 101
3.6 Conclusion & Discussion 105

4 CONCLUDING REMARKS 108

A USEFUL SCRIPTS 110
A.1 R package LPEseq 110
A.2 LPEseq manual 121
A.3 Example script 123

Bibliography 132
Abstract (Korean) 143
-
dc.formatapplication/pdf-
dc.format.extent2955749 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectNGS-
dc.subjectRNA-Seq-
dc.subjectExome-Seq-
dc.subjectStatistical analysis-
dc.subject.ddc574-
dc.titleA statistical analysis for next-generation sequencing data with a small number of samples-
dc.title.alternative자료수가 적은 차세대 염기서열자료의 통계적 분석-
dc.typeThesis-
dc.description.degreeDoctor-
dc.citation.pagesiii, 144-
dc.contributor.affiliation자연과학대학 협동과정 생물정보학전공-
dc.date.awarded2014-02-
Appears in Collections:
College of Natural Sciences (자연과학대학)Program in Bioinformatics (협동과정-생물정보학전공)Theses (Ph.D. / Sc.D._협동과정-생물정보학전공)
Files in This Item:
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse