Publications

Detailed Information

Machine Learning-Based Classification of Proximal and Distal Gastric Cancer in The Cancer Genome Atlas Database : : TCGA 데이터베이스를 활용한 근위부 위암과 원위부 위암의 기계학습 기반 분류

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

이은주

Advisor
박도중
Issue Date
2023
Publisher
서울대학교 대학원
Keywords
Gastric cancerTCGAGenetics
Description
학위논문(석사) -- 서울대학교대학원 : 의과대학 의학과, 2023. 8. 박도중.
Abstract
Background: Gastric cancer is a major global health concern, with different classifications based on its histological subtypes or anatomical location. Proximal gastric cancer (PGC) and distal gastric cancer (DGC) are two anatomically distinct subtypes with different risk factors, and understanding their clinicopathological and genetic characteristics is important for accurate diagnosis and treatment. This study investigated the genetic differences between PGC and DGC using machine learning (ML) approaches and data from The Cancer Genome Atlas (TCGA) program, and focused on identifying differences in DNA copy number variation and RNAseq.

Methods: The TCGA-Stomach Adenocarcinoma (STAD) dataset was used to investigate genetic differences between PGC and DGC. The study conducted classical bioinformatic approaches to distinguish PGC and DGC using a volcano plot and heap map from the selected features. To apply ML algorithms, data preprocessing was conducted by utilizing statistical tests to select noteworthy features, and false discovery rate correction was used to address the multiple testing problem. The study used 10-fold cross-validation for the ML algorithms to predict the location of gastric cancers using the selected features.

The validation was performed on subsets of the data, where different approaches were taken for handling the Fundus/Body data: In Group 1, the analysis excluded the Fundus/Body data; in Group 2, the Fundus/Body data was classified as proximal gastric cancer for analysis; and in Group 3, the Fundus/Body data was classified and analyzed as a separate new group. The best algorithm was then chosen and used to interpret the results with the top 30 features of importance and EnrichR analysis.

Results: The study utilized ML techniques to identify potential genetic features in copy number variation and RNAseq to classify PGC and DGC within the TCGA-STAD dataset. Among the ML algorithms, gradient-boosting algorithms such as CatBoost and LightGBM consistently achieved high performances based on the Area Under the Curve (AUC), regardless of the differences in datasets. When classifying the Fundus/Body as PGC (Group 2), the AUC of the ROC curve was 0.75. However, when analyzing the data excluding the Fundus/Body as PGC (Group 1), the AUC of the ROC curve improved to 0.89. Furthermore, we identified the top 30 important features of CatBoost for classifying the tumor location, including LRRC8D and GULP1, and used them to perform EnrichR analysis, which provided information regarding their relationship with gastric cancer.

Conclusion: By applying ML to the TCGA-STAD database, this study identified potential genetic distinguishing features between PGC and DGC, indicating potential differences in their genetic profiles.
Language
eng
URI
https://hdl.handle.net/10371/197142

https://dcollection.snu.ac.kr/common/orgView/000000179476
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share