Development of a Construction Specialized Pretrained Language Model

Abstract: Due to the nature of construction fields, various irregular text data are generated, and natural language processing is being used in many studies to analyze these data. However, previous studies have limitations that individual models should be created for the study to utilize models that have not been pretrained and lots of labeled data to learn each model is required. On the other hand, there is a difference in the case of pretrained language model that pretraining using unlabeled data in the early days makes a basic model, and then various tasks can be performed only with simple finetuning without creating individual models.
In recent years, some studies have used the pretrained language model, but the pretrained language model used was taught based on general terms, not the term used mainly in the construction field, so there was a limitation in terms of accuracy in analyzing terms of construction.
In order to solve these limitations, this research collected text data used in the construction field and built a construction corpus, and developed and verified a construction specialized pretrained language model by pretraining it.
This research consists of two main stages. First, develop a pretrained language model for construction specialization through data collection and comparison between pretrained language models according to corpus. Second, the superiority of the developed model was verified through experiments and comparisons in terms of accuracy, efficiency, and adaptability between the developed pretrained language model and the previously un-pretrained language model.
The results of these experiments show that the pretrained language model developed in this research is superior in terms of accuracy, efficiency, and adaptability compared to the language model that has not been pretrained, and the accuracy is higher than that of the language model that has been pretrained in general corpus. It is expected that the developed construction specialized pretrained language model can be used to perform various natural language processing tasks in the construction field.
건설분야 특성상 다양한 비정형 텍스트 데이터가 발생하고 있으며, 이러한 데이터를 분석하기 위해 자연어처리가 많은 연구에서 활용되고 있다. 그러나 이전의 연구들은 주로 사전학습 되지 않은 언어모델을 활용하기에 연구 수행을 위해 개별 모델을 만들어야 하고, 각 모델을 학습시키기 위한 라벨링된 데이터를 많이 필요로 한다는 한계점이 있었다. 반면에 사전학습 언어모델의 경우 초기에 라벨링 되지 않은 데이터를 이용해서 사전학습시켜 기본 모델을 만들고, 이후 개별 모델을 만들 필요없이 간단한 미세조정 만으로 다양한 과업을 수행할 수 있다는 차이점이 있다.
최근에는 일부 연구에서 사전학습 언어모델을 활용한 사례도 있었으나 사용한 사전학습 언어모델이 건설분야에서 주로 사용하는 용어가 아닌 일반적인 용어를 기준으로 학습되었기에 건설분야의 용어를 분석하는데 정확도 측면에서 한계가 있었다.
본 연구는 이러한 한계를 해결하기 위해 건설분야에서 사용되는 텍스트 데이터를 수집하여 건설분야 코퍼스를 구축하고, 이를 사전학습 시켜서 건설특화 사전학습 언어모델을 개발 및 검증했다. 연구는 크게 두 가지 단계로 구성되어 있다. 첫째로, 데이터 수집 및 코퍼스에 따른 사전학습 언어모델간 비교를 통한 건설특화 사전학습 언어모델을 개발하였다. 둘째로, 개발된 건설특화 사전학습 언어모델과 기존에 주로 사용하던 사전학습 되지 않은 언어모델과의 정확성, 효율성, 적용성 측면에서의 실험 및 비교를 통해 개발된 모델의 우월성을 검증하였다.
그 결과 본 연구에서 개발한 사전학습 언어모델이 사전학습 되지 않은 언어모델에 비해 정확성, 효율성, 적용성 측면에서 모두 우수함을 보였으며, 일반적인 언어로 사전학습된 언어모델에 비해서도 정확도가 더 높음을 확인하였다. 개발된 건설특화 사전학습 언어모델을 활용하여 건설분야 다양한 자연어처리 과업 수행에 활용할 수 있으리라 기대된다.

Language: eng

URI: https://hdl.handle.net/10371/181165

https://dcollection.snu.ac.kr/common/orgView/000000171390

Files in This Item:

000000171390.pdf 1.56 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Civil & Environmental Engineering (건설환경공학부)
  - Theses (Master's Degree_건설환경공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share