S-Space College of Natural Sciences (자연과학대학) Dept. of Biological Sciences (생명과학부) Theses (Ph.D. / Sc.D._생명과학부)
Development of prokaryotic taxonomy-based 16S rRNA and genome database
원핵미생물 분류체계에 기반한 16S rRNA 유전자 및 유전체 데이터베이스의 개발
- 자연과학대학 생명과학부
- Issue Date
- 서울대학교 대학원
- Bioinformatics; Pipeline; Database; 16S rRNA; Genome; Taxonomy; Microbiome; Next-generation sequencing; Prokaryote
- 학위논문 (박사)-- 서울대학교 대학원 자연과학대학 생명과학부, 2017. 8. 천종식.
- In prokaryotic taxonomy, the 16S ribosomal RNA (rRNA) gene sequence-based approach has served as an alternative standard method to DNA-DNA hybridization (DDH), for which the 97% 16S rRNA gene sequence similarity was considered to be equivalent to the 70% DDH value for species demarcation. While the 16S rRNA-based method is unable to perfectly classify and identify bacterial and archaeal species using 16S rRNA gene, it is currently the most general tool to evaluate the taxonomic position of a prokaryotic strain at the same genus or species levels. Therefore, the 16S rRNA-based approach is still important in the classification of prokaryotes and the use of a database with taxonomically well-curated sequences such as EzTaxon-e is essential for accurate species identification.
There has been a recent evolution of DNA sequencing technologies, called next-generation sequencing (NGS), which has been facilitating Culture-independent microbial community analysis using 16S ribosomal RNA gene as well as the use of genome sequencing data for more informative and precise classification and identification of Bacteria and Archaea. Because the current species definition is based on the comparison of genome sequences between type and other strains in a given species, building a genome database with accurate taxonomic information is a premium need to enhance our efforts in exploring prokaryotic diversity and discovering new species as well as for routine identifications.
In this study, an integrated database, called EzBioCloud, was constructed to hold the taxonomic hierarchy of Bacteria and Archaea that are represented by quality-controlled 16S rRNA gene and genome sequences. The various bioinformatics pipelines, tools, and algorithms which were applied during the construction of the database were also developed to optimally utilize the database contents. For a more efficient 16S rRNA-based analysis, the pairwise sequence alignment algorithm was improved and a high-performance microbial community analysis pipeline was newly developed in order to better facilitate the analysis of massive NGS data and to produce better results than conventional methods. For whole genome based analyses, quality assessment methods for genome assembly and a genome annotation pipeline were developed and evaluated. The full-length 16S rRNA extraction method and efficient average nucleotide identity (ANI) calculation algorithm were utilized in the identification of public prokaryotic genomes.
In order to construct the integrated genome database, whole genome assemblies in the NCBI Assembly Database were first screened to determine low-quality genomes and then subsequently subjected to a composite identification bioinformatics pipeline that employed gene-based searches followed by the calculation of average nucleotide identity. The resulting database consisted of 61,700 species/phylotypes including 13,132 with validly published names, and 62,362 whole genome assemblies that were taxonomically identified at the genus, species and subspecies level. Genomic properties, such as genome size and GC content, and the occurrence in human microbiome data were calculated for each genus or higher taxa. This comprehensive database of taxonomy, 16S rRNA gene, and genome sequences, with its accompaniment of bioinformatics tools, should accelerate genome-based classification and identification of Bacteria and Archaea. The database and related search tools are available at http://www.ezbiocloud.net/.