Machine learning techniques for decoding and utilizing high throughput RNA sequencing data

김민수

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Machine learning techniques for decoding and utilizing high throughput RNA sequencing data : RNA 시퀀싱 데이터의 해독과 활용을 위한 기계학습 기법

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김민수

Advisor: 김선

Issue Date: 2019-08

Publisher: 서울대학교 대학원

Keywords: RNA-seq ; RNA editing ; Alternative splicing ; Gene expression ; Machine learning ; Information theory ; Graph embedding ; Dimension reduction ; Autoencoder

Description: 학위논문(박사)--서울대학교 대학원 :자연과학대학 협동과정 생물정보학전공,2019. 8. 김선.

Abstract: 진핵 세포 시스템에서는 mRNA 분자가 전사된 이후 완전히 처리되어 단백질로 번역될 때까지 여러 단계의 전사 후 조절 과정을 거치게 된다. 전사 후 조절 과정은 RNA 편집, 선택적 접합, 선택적 아데닐화 등을 포함한다. 즉 어느 한 시점에서 전사체를 들여다보면 그 내부는 다양한 중간체들의 혼합물로 구성되어 있는 것이다. 이러한 복잡한 조절 시스템 때문에 전사체를 전체적인 수준에서 이해하기가 쉽지 않다. 본 학위 연구는 RNA 시퀀싱 데이터를 해독하고 활용하기 위한 기계학습 기법들에 대한 연구이며 RNA 편집, 선택적 접합 및 유전자 발현의 관점에서 수행된 세 가지 연구로 구성된다.

RNA 편집은 ADAR(A=>I) 과 APOBEC(C=>U) 두 가지 효소에 의해 촉매 되는 전사 후 RNA 서열 조절 기작이다. RNA 편집은 단백질 활성도, 선택적 접합 및 miRNA 표적 조절 등 다양한 세포 기작을 제어하는 것으로 알려진 중요한 새포 내 조절 시스템이다. RNA 시퀀싱을 이용해 RNA 편집 현상을 검출하는 것은 RNA 편집 현상의 생물학적 기능을 이해하는 데에 매우 중요하다. 문제는 이 과정에서 상당한 양의 위양성이 발생한다는 점이다. 샘플당 수만 개 이상 발생하는 RNA 편집 잔기들 모두를 실험적으로 검증할 수 없기 때문에 이를 걸러내기 위한 전산학적 모델이 요구된다. RDDpred는 RNA 시퀀싱 데이터로부터 RNA 편집 현상을 검출하는 과정에서 발생하는 위양성 잔기들을 기계학습 기술에 기반하여 구분하는 모델이다. RDDpred는 두 개의 기 발표된 RNA 편집 연구 데이터를 이용하여 검증되었다.

RNA 시퀀싱 기술이 활용될 수 있는 또 하나의 복잡한 문제로 접합체 차원에서의 종양 이질성 (ITH) 측정 문제가 있다. ITH는 암 조직을 구성하는 세포 집단의 다양성의 지표이며, 최근 출판된 연구들의 결과는 유전자 발현량 데이터에 기반하여 측정된 전사체 수준에서의 ITH가 암 환자의 예후예측에 유용함을 시사한다. 접합체는 유전자 발현량과 함께 전사체를 구성하는 주요 요소 중 하나이며 따라서 접합체 수준에서 ITH를 측정하는 것은 보다 전체적인 수준에서 전사체 ITH를 연구하기 위한 자연스러운 흐름이다. RNA 시퀀싱 데이터를 이용하여 암 접합체 수준에서 ITH를 측정하는 과정에는 복잡한 접합 패턴과 광범위한 인트론 연장 변이 및 짧은 시퀀싱 판독 길이 등의 심각한 기술적 난관들이 있다. SpliceHetero는 이러한 문제들을 고려하여 접합체 수준에서의 ITH (즉, sITH)를 측정하기 위한 도구이며 내부적으로 정보이론을 활용한다. SpliceHetero는 시뮬레이션 데이터, 이종이식 종양 데이터 및 TCGA pan-cancer 데이터 등을 활용하여 광범위하게 검증되었으며 ITH를 잘 반영하는 것으로 확인되었다. 이뿐 아니라 sITH는 암의 진행과 암 환자의 예후 및 PAM50와 같은 잘 알려진 분자 아형들과도 높은 상관관계를 가지는 것으로 확인되었다.

마지막 연구 주제는 유전자 발현량 데이터에 기반하여 특정 암 표현형에 특이적인 환자 부분 공간을 정의하는 기계학습 알고리즘을 개발하는 것이다. RNA 시퀀싱 데이터는 암 환자의 유전자 발현량 프로파일을 얻는 데에 유용한 도구이지만, 2만 개 이상의 차원을 가진 매우 고차원의 데이터이기 때문에 실질적인 용도로 사용되기 위해서는 그 차원의 크기를 축소할 필요가 있다. 이때 각 유전자들은 복잡하지만 고유한 방식으로 서로 상호작용한다는 점을 이용할 수 있다. 실험적으로 검증된 단백질 간의 상호작용 정보를 모아 네트워크 형태로 묶은 것을 단백질 상호작용 네트워크 (혹은 PIN)라 부른다. 이 PIN을 활용하여 RNA 시퀀싱 데이터의 차원을 줄이면서도 데이터로부터 생물학적으로 유의미한 특징들을 추출할 수 있다. Tumor2Vec은 이렇게 추출된 PIN 수준의 특징들을 활용하여 특정 암 표현형에 특이적인 환자 부분 공간을 정의한다. Tumor2Vec은 조기 구강 암에서 림프절 전이를 예측하기 위한 파일럿 연구에 적용되었으며 그 결과 RNA 시퀀싱 데이터의 차원을 줄여 림프절 전이 예측 모델을 생성했고 이 과정에서 암 표현형을 잘 설명하는 PIN 수준의 특징들을 보존하는 데에도 성공했다.
In eukaryotic cells, there are several post-transcriptional modification steps such as RNA editing and alternative splicing, until mRNA molecules are fully matured and translated into proteins. Thus, the transcriptome is a complex mixture of various intermediates that are processed in multiple steps. This complex regulatory structure makes it difficult to fully understand the landscape of transcriptome. My doctoral study consists of three studies that enable RNA-seq to be decoded and utilized in terms of RNA editing, alternative splicing, and gene expression.

RNA editing is a post-transcriptional RNA sequence modification performed by two catalytic enzymes ADAR (A-to-I) and APOBEC (C-to-U). RNA editing is considered an important regulatory system that controls a variety of cellular functions such as protein activation, alternative splicing, and miRNA targeting. Therefore, detecting RNA editing events in RNA-seq data is important for understanding its biological functions. However, it is known that a significant amount of false-positives occur when detecting RNA editing in RNA-seq. Since it is not possible to experimentally validate all RNA editing residues extracted from RNA-seq, a computational model is needed to filter potential false-positive RNA editing calls. RDDpred, an RNA editing predictor based on machine learning techniques, was developed to filter out false-positive RNA editing calls in RNA-seq. It uses prior knowledge bases to collect training instances directly from the input data, and then trains the random forest (RF) predictors that are specific to the input data. RDDpred was tested using two publicly available datasets of RNA editing studies and has shown good performance.

Another complex problem in RNA-seq decoding is spliceomic intratumor heterogeneity (ie, sITH). Intratumor heterogeneity (ITH) represents the diversity of cell populations that make up the cancer tissue. Recent studies have identified ITH at the transcriptome level and suggested that ITH at gene expression levels is useful for predicting prognosis. Measuring ITH levels at the spliceome level is a natural extension. There is a serious technical challenge in measuring sITH from bulk tumor RNA-seq, such as complex splicing patterns, widespread intron retentions, and short sequencing read lengths. SpliceHetero, an information-theoretic method for measuring the sITH of a tumor, was developed to address the aforementioned technical problems. SpliceHetero was extensively tested in experiments using synthetic data, xenograft tumor data and TCGA pan-cancer data and measured sITH successfully. Also, sITH was shown to be closely related to cancer progression and clonal heterogeneity, along with clinically significant features such as cancer stage, survival outcome, and PAM50 subtype.

The last research topic is to develop a machine learning algorithm that defines patient subspaces specific to particular cancer phenotypes based on gene expression data. Since RNA-seq data is high-dimensional data composed of 20,000 or more genes in general, it is not easy to apply a machine learning algorithm. A network that collects information of experimentally verified interaction of proteins is called a Protein Interaction Network (PIN). Tumor2Vec defines the patient subspace by defining the subnetwork communities that interact with each other by applying the Graph Embedding technique to PIN. Tumor2Vec proposed a clinical model by defining a subspace for patients with different lymph node metastases in early oral cancer and found biologically significant features in the PIN subnetwork unit in the process.

Language: eng

URI: https://hdl.handle.net/10371/162454

http://dcollection.snu.ac.kr/common/orgView/000000157922

Files in This Item:

000000157922.pdf 1.53 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Program in Bioinformatics (협동과정-생물정보학전공)
  - Theses (Ph.D. / Sc.D._협동과정-생물정보학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share