S-Space College of Dentistry/School of Dentistry (치과대학/치의학대학원) Dept. of Dental Science(치의과학과) Theses (Ph.D. / Sc.D._치의과학과)
Scalable graph path discovery for biomedical linked data
대용량 의생물학 링크드 데이터를 위한 그래프 경로 탐색
- 치의학대학원 치의과학과
- Issue Date
- 서울대학교 대학원
- 학위논문 (박사)-- 서울대학교 대학원 : 치의과학과, 2017. 2. 김홍기.
- A drug could give rise to an adverse effect when combined with another particular drug. Addressing the underlying causes of the adverse effects is crucial for researchers to develop new drugs and for clinicians to prescribe medicine. Most existing approaches attempt to identify a set of target genes for which drugs are most effective, which provides insufficient information regarding these causes in terms of biological dynamics. Drugs should instead be considered as participants in activating a sequence of pathways that lead to some effects. I believe that the causes can better be understood by such linked pathways. Therefore, the purpose of this thesis is to develop algorithms and tools that can be used to discover a sequence of pathways that is activated by a particular drug combination. Furthermore, these algorithms are required to be scalable to manage massive biomedical Linked Data because up-to-date results of biomedical research are increasingly available in Linked Data.
My hypothesis is that for a drug combination, when a drug up-regulates particular pathways in one direction and another drug down-regulates the same pathways in an opposite direction, adverse effects may occur by the drug combination. In this regard, the problem of revealing the causes of adverse effects of drug combinations is cast into the problem of discovering paths of a sequence of linked pathways that begins and ends at the genes that the given drugs target. Therefore, the scalable graph path discovery and matching algorithms are devised such that they work with a distributed computing environment. A pathway graph model is defined to integrate diverse biomedical datasets and a visualization tool is implemented to provide biomedical researchers and clinicians with intuitive interfaces for revealing the causes of the adverse effects.
An algorithm for the shortest graph path discovery is proposed. An existing relational database approach is adapted to address the shortest graph path discovery in a distributed computing framework, in particular, Spark. The 2-hop reachability index is exploited to prune non-reachable paths during discovery computation. A vertex re-labeling technique is proposed to reduce the size of the 2-hop reachability index. Experimental results show that the proposed approach can successfully manage a large graph, which previous studies have failed to do.
The discovered shortest graph path can be transformed into a graph path query to find another similar graph path. To achieve this, a MapReduce algorithm for graph path matching, based on multi-way joins, is proposed. A signature encoding technique is devised to prune intermediate data that is not relevant to the given query. Experiments against RDF (Resource Description Framework) datasets show that SPARQL query processing is faster than the state-of-the-art approaches.
To adapt these algorithms into the problem of drug combinations causing adverse effects, a novel pathway graph model is proposed. In particular, a pathway relationship model is described
directed links between pathways are established using protein–protein interactions and up/down regulations between genes. A prototype system based on a visualization framework is implemented and applied to a pathway graph that is built on the basis of several biomedical Linked Data (e.g. Reactome, KEGG, BioGrid, STRING and etc). A list of candidate drug combinations is obtained using the proposed system, which is compared with known drug-drug combinations available in DrugBank.
A scalable graph path discovery solution is proposed in this thesis. Distributed computing frameworks and several index structures are exploited to efficiently handle massive graphs. A pathway graph model is defined and a prototype system for biomedical researchers is implemented to apply the algorithms to the problem of drug combinations causing adverse effects. In future works, the solution will be generalized to address the temporal organization of signaling pathways, thereby enabling the causes of adverse effects of drug combination to be better understood.