이종언어 임상기록 내  흡연 음주 정보 추출 및 분류를 위한 자연어처리 알고리즘 개발

배예슬

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

이종언어 임상기록 내 흡연 음주 정보 추출 및 분류를 위한 자연어처리 알고리즘 개발 : Natural Language Processing Algorithm for Extraction and Classification of Smoking and Drinking Information from Bilingual Clinical Notes

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 배예슬

Advisor: 윤형진

Issue Date: 2023

Publisher: 서울대학교 대학원

Keywords: 자연어처리 ; 흡연 ; 음주 ; 전자건강기록 ; 이종언어처리 ; 트랜스포머

Description: 학위논문(박사) -- 서울대학교대학원 : 의과대학 협동과정 의료정보학전공, 2023. 8. 윤형진.

Abstract: 흡연과 음주는 중요한 질병 위험 인자이자, 의학 연구에 필수적인 중요변수이다. 따라서 환자의 현재 흡연과 음주 상태에 대한 정보를 확보하는 것이 중요하나, 우리나라는 전자의무기록 작성 시 한국어와 영어가 혼용되어 사용되므로 자유진술문으로 구성된 임상기록에서 이를 자동으로 확보하는 연구는 드물다. 또한 의료 영역에서 한국어가 포함된 대규모 corpus가 부재하므로 self-supervised learning인 transformer 방법론을 적용하여 fine-tuning하기 어렵다. 이에 본 논문에서는 Shifted Positive Pointwise Mutual Information (SPPMI)와 transformer 기반의 서로 다른 두 방법론으로 구조화 되어있지 않은 한영 혼용 임상기록에서 키워드를 추출하고 문서를 분류하는 연구를 진행하였다.
흡연 연구에서는 총 4,711개의 임상기록이 활용되었으며 python package인 Soynlp를 사용하여 두문자어를 치환하고, 임상기록을 일반화하였다. SPPMI를 사용하여 기록 내 각 단어를 벡터화 하였고, cosine similarity를 계산하여 흡연 상태를 분류하였다. word co-occurrence, pointwise mutual information (PMI), normalized pointwise mutual information (NPMI)와 같은 다른 방법과 비교하여, 본 연구는 키워드 추출 정확도(macro F1-score)를 20% 이상 향상시켰다. 또한 문서 내 흡연 관련 키워드를 추출하는데 제안한 SPPMI 방법이 가장 높은 precision을 보였으며 흡연 상태를 분류하는데 91.49%의 높은 정확도를 보였다.
음주 관련 정보를 추출하고 문서를 분류하기 위하여 총 4,996개의 임상기록이 활용되었으며 multilingual transformer를 fine-tuning 하였다. 제안한 cross-lingual language model - robustly optimized BERT pretraining approach (XLM-RoBERTa) 모델은 다른 방법과 비교하여 높은 키워드 추출 정확도(macro F1-score 84.7%)를 보였으며, 78.23%의 precision과 95.33%의 recall을 보여 우수한 성능을 보였다. 또한 술의 종류와 알코올 소비패턴의 보조정보를 추가로 학습한 경우, 음주 정보에 대해 문서를 분류하는 성능이 multilingual BERT는 9.87%, XLM-RoBERTa는 11.7%의 향상되었다.
본 연구는 한영 혼용으로 작성된 자유진술문 구조의 임상 기록에서 유의미한 정보를 추출하고 문서를 분류한 최초의 연구이다. 또한 현재 활발히 사용되고 있는 transformer에 한영 혼용 임상 기록을 적용하여 유의미한 정보를 확보하는 최초의 연구이기도 하다. 향후 흡연 음주 정보뿐 아니라 임상기록 내 다양한 정보를 확보하는 데 본 연구에서 제안하는 방법을 활용할 수 있을 것이다. 향후 다양한 진료과와 의료기관에서 생성된 임상기록을 대상으로 유효성 검증 연구가 필요할 것이다.
Smoking and alcohol consumption are important variable for clinical research, but there are few studies regarding automatic obtainment of classification from unstructured bilingual clinical notes. The study aims to develop an algorithm to classify smoking and alcohol consumption status based on unstructured clinical notes using natural language processing (NLP).
A total of 4,711 clinical notes were used for smoking information extraction and classification. Acronyms were normalized using Soynlp, a Python package. Each group in the record was vectorized using Shifted Positive Pointwise Mutual Information (SPPMI), and smoking status was classified by calculating cosine similarity. Compared to other methods such as word co-occurrence, pointwise mutual information (PMI), and normalized pointwise mutual information (NPMI), this study improved keyword extraction accuracy by more than 20%. Additionally, the proposed SPPMI method for extracting smoking-related keywords in documents showed the highest precision and achieved a high accuracy of 91.49% in classifying smoking status.
For drinking-related information extraction and classification, a total of 4,996 clinical records were used. The Multilingual transformer was fine-tuned and learned. The proposed cross-lingual language model - robustly optimized BERT pretraining approach (XLM-RoBERTa) model showed high keyword extraction accuracy (macro F1-score 84.7%) compared to other methods and demonstrated excellent performance with improvements in precision and recall, reaching 78.23% and 95.33%, respectively. Furthermore, when additional information about the type of alcohol and alcohol consumption pattern was included in the learning process, the ability to classify documents for drinking information improved by 9.87% in multilingual BERT and 11.7% in XLM-RoBERTa.
This study shows the potential of SPPMI and transformer in classifying smoking status and alcohol consumption from bilingual clinical notes. Moreover, it is the first study to extract meaningful information and classify documents from clinical records consisting of free text written in both Korean and English.
In the future, the proposed method can be used to capture various information in clinical notes as well as smoking and drinking information. Future validation studies with clinical notes generated by various departments and medical institutions will be necessary.

Language: kor

URI: https://hdl.handle.net/10371/197172

https://dcollection.snu.ac.kr/common/orgView/000000177754

Files in This Item:

000000177754.pdf 1.13 MB

Appears in Collections:

College of Medicine/School of Medicine (의과대학/대학원)
- Program in Medical Informatics (협동과정-의료정보학전공)
  - Theses (Ph.D. / Sc.D_협동과정-의료정보학전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share