Browse

Development of prokaryotic taxonomy-based 16S rRNA and genome database
원핵미생물 분류체계에 기반한 16S rRNA 유전자 및 유전체 데이터베이스의 개발

DC Field Value Language
dc.contributor.advisor천종식-
dc.contributor.author윤석환-
dc.date.accessioned2017-10-27T17:12:13Z-
dc.date.available2017-10-27T17:12:13Z-
dc.date.issued2017-08-
dc.identifier.other000000145201-
dc.identifier.urihttps://hdl.handle.net/10371/137145-
dc.description학위논문 (박사)-- 서울대학교 대학원 자연과학대학 생명과학부, 2017. 8. 천종식.-
dc.description.abstractIn prokaryotic taxonomy, the 16S ribosomal RNA (rRNA) gene sequence-based approach has served as an alternative standard method to DNA-DNA hybridization (DDH), for which the 97% 16S rRNA gene sequence similarity was considered to be equivalent to the 70% DDH value for species demarcation. While the 16S rRNA-based method is unable to perfectly classify and identify bacterial and archaeal species using 16S rRNA gene, it is currently the most general tool to evaluate the taxonomic position of a prokaryotic strain at the same genus or species levels. Therefore, the 16S rRNA-based approach is still important in the classification of prokaryotes and the use of a database with taxonomically well-curated sequences such as EzTaxon-e is essential for accurate species identification.
There has been a recent evolution of DNA sequencing technologies, called next-generation sequencing (NGS), which has been facilitating Culture-independent microbial community analysis using 16S ribosomal RNA gene as well as the use of genome sequencing data for more informative and precise classification and identification of Bacteria and Archaea. Because the current species definition is based on the comparison of genome sequences between type and other strains in a given species, building a genome database with accurate taxonomic information is a premium need to enhance our efforts in exploring prokaryotic diversity and discovering new species as well as for routine identifications.
In this study, an integrated database, called EzBioCloud, was constructed to hold the taxonomic hierarchy of Bacteria and Archaea that are represented by quality-controlled 16S rRNA gene and genome sequences. The various bioinformatics pipelines, tools, and algorithms which were applied during the construction of the database were also developed to optimally utilize the database contents. For a more efficient 16S rRNA-based analysis, the pairwise sequence alignment algorithm was improved and a high-performance microbial community analysis pipeline was newly developed in order to better facilitate the analysis of massive NGS data and to produce better results than conventional methods. For whole genome based analyses, quality assessment methods for genome assembly and a genome annotation pipeline were developed and evaluated. The full-length 16S rRNA extraction method and efficient average nucleotide identity (ANI) calculation algorithm were utilized in the identification of public prokaryotic genomes.
In order to construct the integrated genome database, whole genome assemblies in the NCBI Assembly Database were first screened to determine low-quality genomes and then subsequently subjected to a composite identification bioinformatics pipeline that employed gene-based searches followed by the calculation of average nucleotide identity. The resulting database consisted of 61,700 species/phylotypes including 13,132 with validly published names, and 62,362 whole genome assemblies that were taxonomically identified at the genus, species and subspecies level. Genomic properties, such as genome size and GC content, and the occurrence in human microbiome data were calculated for each genus or higher taxa. This comprehensive database of taxonomy, 16S rRNA gene, and genome sequences, with its accompaniment of bioinformatics tools, should accelerate genome-based classification and identification of Bacteria and Archaea. The database and related search tools are available at http://www.ezbiocloud.net/.
-
dc.description.tableofcontentsCHAPTER 1 General introduction 1
1.1. Taxonomy of prokaryotes 2
1.1.1. Principle of prokaryotic taxonomy 2
1.1.2. Prokaryotic species concept 4
1.2. Next generation sequencing (NGS) 8
1.2.1. 454 Pyrosequencing 8
1.2.2. Illumina-Solexa sequencing 10
1.2.3. Pacific Bioscience SMRT sequencing 11
1.3. Use of 16S rRNA gene in microbiology 13
1.4. Prokaryotic genomics 17
1.5. Objectives of this study 21
CHAPTER 2 Development of bioinformatics pipelines and tools for EzBioCloud database 23
2.1. Introduction 24
2.1.1. 16S rRNA based prokaryote identification algorithm 25
2.1.2. Microbial community analysis 27
2.1.3. 16S rRNA sequence in genome with short-read sequencing data 31
2.1.4. Public genome data of prokaryotes 31
2.1.5. Quality of genome assembly 32
2.1.6. Average nucleotide identity 33
2.2. Materials and method 36
2.2.1. Improvement of 16S rRNA sequence based identification algorithm 36
2.2.2. Development of microbial taxonomic profiling (MTP) pipeline 38
2.2.3. Method for extracting full-length 16S rRNA genes from short-read sequencing data 42
2.2.4. Pipeline for prokaryotic whole genome analysis 44
2.2.5. Methods for the quality assessment of genome 48
2.2.6. Efficient calculation method for average nucleotide identity 52
2.3. Results 54
2.3.1. Advanced microbial taxonomic profiling (MTP) pipeline 54
2.3.2. Comparison of full length 16S rRNA extraction methods 62
2.3.3. Annotation of public genomes 66
2.3.4. Quality of bacterial genomes 68
2.3.5. Evaluation of algorithms for average nucleotide identity 75
2.4. Discussion 81
CHAPTER 3 Development of EzBioCloud: A taxonomically united database of 16S rRNA and whole genome assemblies 84
3.1. Introduction 85
3.2. Methods 87
3.2.1. Data collection 87
3.2.2. Identification of genome sequences 90
3.2.3. Calculation of genomics features for each taxon 93
3.2.4. Bacterial community analysis of human microbiome 93
3.2.5. Operating system and software development 95
3.3. Results 96
3.3.1. Comparison of databases 96
3.3.2. Hierarchical taxonomic backbone 99
3.3.3. Identification of genome projects 103
3.3.4. Genome-derived information 107
3.4. Discussion 108
CHAPTER 4 General conclusions 111
REFERENCES 115
국문초록 130
-
dc.formatapplication/pdf-
dc.format.extent3762145 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectBioinformatics-
dc.subjectPipeline-
dc.subjectDatabase-
dc.subject16S rRNA-
dc.subjectGenome-
dc.subjectTaxonomy-
dc.subjectMicrobiome-
dc.subjectNext-generation sequencing-
dc.subjectProkaryote-
dc.subject.ddc570-
dc.titleDevelopment of prokaryotic taxonomy-based 16S rRNA and genome database-
dc.title.alternative원핵미생물 분류체계에 기반한 16S rRNA 유전자 및 유전체 데이터베이스의 개발-
dc.typeThesis-
dc.contributor.AlternativeAuthorSeok-Hwan Yoon-
dc.description.degreeDoctor-
dc.contributor.affiliation자연과학대학 생명과학부-
dc.date.awarded2017-08-
Appears in Collections:
College of Natural Sciences (자연과학대학)Dept. of Biological Sciences (생명과학부)Theses (Ph.D. / Sc.D._생명과학부)
Files in This Item:
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse