Publications

Detailed Information

Representation Learning for Biological Sequence Data : 생물학적 서열 데이터에 대한 표현 학습

DC Field Value Language
dc.contributor.advisor윤성로-
dc.contributor.author민선우-
dc.date.accessioned2022-04-20T07:50:16Z-
dc.date.available2022-04-20T07:50:16Z-
dc.date.issued2021-
dc.identifier.other000000166769-
dc.identifier.urihttps://hdl.handle.net/10371/178924-
dc.identifier.urihttps://dcollection.snu.ac.kr/common/orgView/000000166769ko_KR
dc.description학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 윤성로.-
dc.description.abstractAs we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored.

This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.
-
dc.description.abstract우리는 빅데이터의 시대를 맞이하고 있으며, 의생명 분야 또한 예외가 아니다. 차세대 염기서열 분석과 같은 기술들이 도래함에 따라, 폭발적인 의생명 데이터의 증가를 활용하기 위한 방법론의 개발은 생물정보학 분야의 주요 과제 중의 하나이다. 심층 학습을 포함한 표현 학습 기법들은 인공지능 학계가 오랫동안 어려움을 겪어온 다양한 분야에서 상당한 발전을 이루었다. 표현 학습은 생물정보학 분야에서도 많은 가능성을 보여주었다. 하지만 단순한 적용으로는 생물학적 서열 데이터 분석의 성공적인 결과를 항상 얻을 수는 않으며, 여전히 연구가 필요한 많은 문제들이 남아있다.

본 학위논문은 생물학적 서열 데이터 분석과 관련된 세 가지 사안을 해결하기 위해, 표현 학습에 기반한 일련의 방법론들을 제안한다. 첫 번째로, 유전자가위 실험 데이터에 내재된 정보와 수율의 균형에 대처할 수 있는 2단계 학습 기법을 제안한다. 두 번째로, 두 염기 서열 간의 상호 작용을 학습하기 위한 부호화 방식을 제안한다. 세 번째로, 기하급수적으로 증가하는 특징되지 않은 단백질 서열을 활용하기 위한 자기 지도 사전 학습 기법을 제안한다. 요약하자면, 본 학위논문은 생물학적 서열 데이터를 분석하여 중요한 정보를 도출할 수 있는 표현 학습에 기반한 일련의 방법론들을 제안한다.
-
dc.description.tableofcontents1 Introduction 1
1.1 Motivation 1
1.2 Contents of Dissertation 4
2 Background 8
2.1 Representation Learning 8
2.2 Deep Neural Networks 12
2.2.1 Multi-layer Perceptrons 12
2.2.2 Convolutional Neural Networks 14
2.2.3 Recurrent Neural Networks 16
2.2.4 Transformers 19
2.3 Training of Deep Neural Networks 23
2.4 Representation Learning in Bioinformatics 26
2.5 Biological Sequence Data Analyses 29
2.6 Evaluation Metrics 32
3 CRISPR-Cpf1 Activity Prediction 36
3.1 Methods 39
3.1.1 Model Architecture 39
3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41
3.2 Experiment Results 44
3.2.1 Datasets 44
3.2.2 Baselines 47
3.2.3 Evaluation of Seq-deepCpf1 49
3.2.4 Evaluation of DeepCpf1 51
3.3 Summary 55
4 Functional microRNA Target Prediction 56
4.1 Methods 62
4.1.1 Candidate Target Site Selection 63
4.1.2 Input Encoding 64
4.1.3 Residual Network 67
4.1.4 Post-processing 68
4.2 Experiment Results 70
4.2.1 Datasets 70
4.2.2 Classification of Functional and Non-functional Targets 71
4.2.3 Distinguishing High-functional Targets 73
4.2.4 Ablation Studies 76
4.3 Summary 77
5 Self-supervised Learning of Protein Representations 78
5.1 Methods 83
5.1.1 Pre-training Procedure 83
5.1.2 Fine-tuning Procedure 86
5.1.3 Model Architecturen 87
5.2 Experiment Results 90
5.2.1 Experiment Setup 90
5.2.2 Pre-training Results 92
5.2.3 Fine-tuning Results 93
5.2.4 Comparison with Larger Protein Language Models 97
5.2.5 Ablation Studies 100
5.2.6 Qualitative Interpreatation Analyses 103
5.3 Summary 106
6 Discussion 107
6.1 Challenges and Opportunities 107
7 Conclusion 111
Bibliography 113
Abstract in Korean 130
-
dc.format.extentix, 130-
dc.language.isoeng-
dc.publisher서울대학교 대학원-
dc.subjectmachine learning-
dc.subjectdeep learning-
dc.subjectrepresentation learning-
dc.subjectartificial intelligence-
dc.subjectbiological sequence-
dc.subjectCRISPR-
dc.subjectmicroRNA target-
dc.subjectprotein-
dc.subject.ddc621.3-
dc.titleRepresentation Learning for Biological Sequence Data-
dc.title.alternative생물학적 서열 데이터에 대한 표현 학습-
dc.typeThesis-
dc.typeDissertation-
dc.contributor.AlternativeAuthorSeonwoo Min-
dc.contributor.department공과대학 전기·정보공학부-
dc.description.degree박사-
dc.date.awarded2021-08-
dc.identifier.uciI804:11032-000000166769-
dc.identifier.holdings000000000046▲000000000053▲000000166769▲-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share