Automated Construction Specification Review based on Semantic Textual Analysis

Abstract: The risk management of construction project requires a clear and objective understanding of construction specifications in early phases to ensure that the requirements are appropriate to the site environment. However, the review process is disturbed by the tight schedule of the bidding process, the insufficient number of available experts, and the large volume of contents (generally several thousand pages). Moreover, since the review process is mainly carried out based on human cognitive abilities, it takes considerable time as well as is vulnerable to errors, such as subjective interpretation, misunderstanding, and omitting of requirements. Despite the promising results of previous approaches to automate the process of analyzing construction documents and extracting useful information, they need technical improvements as not considering the semantic textual conflicts of different documents. Since every construction project provides individual specification and even updates the document periodically, the review process requires to analyze different documents that have different semantic features, such as different vocabulary, different sentence structures, and differently organized clauses. Addressing the semantic textual conflicts is challenging to automate the construction specification review process with a sufficient level of applicability and support the project risk management.

This dissertation aims to develop an automated construction specification review method via semantic textual analysis. First, the author developed a semantic construction thesaurus to understand different vocabulary of the specifications using Word2Vec embedding and PageRank algorithm. Second, the author recognized construction keywords of qualitative requirements from natural language sentences by developing a Named Entity Recognition (NER) model using Word2Vec embedding and the Bi-directional Long Short-Term Memory (Bi-LSTM) architecture with Conditional Random Field (CRF) layer. Third, the author proposed a relevant clause pairing model that identified the most relevant clause from the standard specification for every clause in the construction specification using Doc2Vec embedding and semantic similarity calculation. Eventually, the proposed method would provide a table of clauses, which includes the most relevant clause and the recognized keywords related to construction requirements.

First, to achieve the first research objective, the author analyzed the words that were similarly distributed within the sentence using the Word2Vec model and determined the pivot term for each closed network of converting words. After analyzing 346,950 words (i.e., 19,346 sentences) from 56 construction specifications, the construction thesaurus covered 208 word replacement rules. Second, to achieve the second research objective, the five information types (i.e., persons and organizations in charge, activities required, construction and installation items, quality standards and criteria, and relevant references) that are crucial in the risk management process were determined via in-depth collaboration with experienced contractors. Then, the NER model was developed with 4,659 labeled sentences, where the input was word vectors embedded by Word2Vec and the output was the word categories standing for the determined five information types. The model showed satisfactory results with an F1 score of 0.917 in classifying the word categories within the sentences. The robustness of the model was verified with 30 different sets of randomly split training and validation data. Third, to achieve the third research objective, the manually extracted text data of 2,527 clauses were embedded by Doc2Vec to utilize the semantic features in the pairing process. Then, clause relevance was calculated is based on the cosine similarity between the text vectors to identify the most relevant text. As a result, the relevant clauses were paired with the averaged accuracy of 81.8%.

To validate the proposed approaches, the author conducted experiments. The validation indicators included time efficiency, the accuracy of detecting erroneous provisions, and robustness to subjectivity. The experimental results outperformed the manual review process with reducing working hours, improving performances, and providing more consistent results. Also, the results demonstrated the necessity and practical usefulness of the proposed method for automatic specification review. By utilizing the automated method of semantic text comparison, the users can address the semantic textual conflicts of the specifications (i.e., different vocabulary, different sentence structures, and differently organized clauses), which enables an adequate review of the project requirements.

In conclusion, this dissertation developed the automated construction specification review method by analyzing the semantic textual properties. Particularly, the author identified the semantic textual conflict among construction specifications (i.e., different vocabulary, different sentence structures, and differently organized clauses) that cause difficulty in automating the review process. The author developed the machine learning-based NLP models to facilitate the automated construction specification review. To the best of the authors knowledge, this is the first attempt to handle semantic textual conflict in the field of construction document analysis. The developed method benefits to the contractors who review specifications in the early phases of the construction project, the field engineers who analyze the requirements during the construction phases, and the clients who write a new specification for a project. The proposed approaches enhance the applicability of automated construction specification reviews and can be quickly customized for other types of construction documents, including contract documents, non-conformance reports, accident reports, and inspection reports. Besides, the research would facilitate an in-depth understanding of diverse and complicated construction specifications as well as the review process of the document that could further bring opportunities for improvements in the areas of construction automation and risk management.
건설 프로젝트의 리스크 관리를 위해서는 건설공사 시방서의 시공기준이 현장 상황에 적합한지 사전에 검토하는 것이 중요하다. 하지만, 계약 단계의 촉박한 일정, 활용 가능한 전문인력의 부족, 검토해야 하는 다량의 정보 등으로 인해 시방서 검토 과정에 어려움이 존재한다. 또한, 시방서 검토 작업은 수작업으로 진행되기 때문에 시간이 오래 걸리고, 주관적인 해석, 착오, 누락 등의 오류에 취약하다. 건설 문서를 분석하고 사용자가 필요로 하는 정보를 제공하는 다수의 연구 결과가 만족스러운 성능을 보였지만, 서로 다른 문서에 존재하는 텍스트의 의미 모호성을 고려하지 않았다는 점에서 기술적인 개선이 요구된다. 건설공사 시방서는 매 건설 프로젝트마다 작성되며 주기적으로 갱신되기 때문에, 실무자는 서로 다른 어휘, 문장 구조, 조항 구성 등을 가지는 새로운 문서를 매번 새로 분석해야 한다. 건설공사 시방서 검토 작업을 자동화하고 프로젝트 리스크 관리를 지원하기 위해 이러한 텍스트의 특성을 분석하는 연구가 필요하다.
본 연구는 의미기반 텍스트 비교분석을 통한 건설공사 시방서 자동 검토 방법론을 제안한다. 첫 째로, 같은 대상이 시방서 마다 다른 단어로 표현되는 문제를 해결하기 위해, Word2Vec 임베딩 기법과 PageRank 알고리즘을 활용하여 건설어 시소러스를 구축한다. 둘 째로, 서로 다른 형식으로 작성된 문장으로부터 시공기준 정보를 추출하기 위해, Word2Vec 임베딩 기법과 Bi-LSTM 및 CRF 아키텍처를 활용하여 NER 모델을 개발한다. 셋 째로, 서로 다른 시방서로부터 관련성이 높은 조항을 대응하기 위해 Doc2Vec 임베딩 기법과 의미기반 유사도 분석 방법론을 활용하여 조항 대응 모델을 개발한다. 본 연구의 결과는 건설공사 시방서의 모든 조항에 대해 각 조항에 가장 관련성 높은 조항과 해당 조항의 시공기준 정보를 표의 형태로 사용자에게 제공한다.

우선, 첫 번째 연구 목표를 달성하기 위해 Word2Vec 임베딩 기법을 적용하여 유사하게 사용되는 단어들을 분석했고, 각 단어들을 변환하는 중심 단어(pivot term)를 선정했다. 연구에서 수집한 56개 시방서의 346,950개 단어(19,346개 문장)를 분석한 결과, 총 208개의 단어 변환 규칙을 가지는 시소러스를 구축했다. 다음으로, 두 번째 연구 목표를 달성하기 위해 건설산업 실무자들과의 협업을 통해 리스크 관리 관점에서 중요하다고 여겨지는 5개의 정보 타입(책임 주체, 작업 내용, 건설공사 객체, 시공기준, 참고문헌)을 선정했다. 4,659개 문장의 실험 데이터를 사용해 Word2Vec 벡터를 인풋으로 받아 각 단어를 5개 정보 타입으로 분류하는 NER 모델을 개발했으며, 모델은 클래스 평균 0.917의 F1 스코어를 보이는 등 우수한 성능을 확보했다. 또한, 30개의 무작위로 구분된 학습/검증 데이터셋을 통해 NER 모델이 특정한 학습 데이터에 과적합되지 않았다는 것을 증명했다. 마지막으로, 세 번째 연구 목표를 달성하기 위해 수작업으로 구축된 2,527개의 조항들로부터 Doc2Vec 임베딩 기법으로 의미적 특징을 추출했다. 각 조항에 대응되는 조항을 찾기 위해 코사인 유사도에 기반하여 조항 연관성을 계산했고, 최종 결과는 시방서 검토 작업의 시간을 단축하고, 검토 결과의 품질을 향상시켰으며, 작업자의 주관성을 저감하는 효과를 보였다.

제안된 방법론을 검증하기 위해 본 연구는 자동 검토 모델과 건설 분야 실무자의 시방서 검토 과정 및 결과를 비교 분석했다. 모델의 자동 검토 능력을 평가하기 위해 시방서를 검토하는 데 소요되는 시간, 잘못된 조항을 검출하는 정확성, 검토 결과의 객관성 등 다양한 지표를 활용했다. 검증 결과, 의미기반 텍스트 비교분석 방법론을 활용하여 서로 다른 시방서의 모호한 특성에 따른 검토의 어려움을 해소할 수 있다는 것을 확인했다.

결론적으로, 본 논문은 건설공사 시방서 검토 과정을 자동화하기 위해 텍스트의 의미적 모호성을 분석했다. 건설공사 시방서의 자동화를 저해하는 요소인 텍스트의 의미적 모호성을 정의했고, 머신러닝 기반 자연어 처리 기법을 적용하여 각 문제에 대응했다. 이는 건설 문서를 자동으로 분석하는 연구 분야에서 서로 다른 문서의 의미적 특성을 고려한 첫 번째 시도이다. 제안된 방법은 건설 프로젝트의 초기 단계에 시방서를 검토하려는 실무자, 시공 단계에 각 조항의 내용을 분석하려는 시공자, 새로운 프로젝트 발주를 위해 시방서를 제작하려는 발주처 등 다양한 관점에서 사용된다. 연구 결과는 간단한 처리를 거쳐 계약 문서, 부적합 보고서, 안전사고 보고서, 정밀점검 보고서 등 건설 분야의 다양한 텍스트 데이터에 적용될 수 있다. 또한, 건설공사 시방서의 구조와 검토 과정을 심층적으로 분석함으로써 건설 자동화에 기여하고, 이를 통해 건설 프로젝트의 리스크 대응을 효과적으로 지원할 수 있다.

Language: eng

URI: https://hdl.handle.net/10371/169097

http://dcollection.snu.ac.kr/common/orgView/000000161776

Files in This Item:

000000161776.pdf 7.74 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Civil & Environmental Engineering (건설환경공학부)
  - Theses (Ph.D. / Sc.D._건설환경공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share