Hierarchical Context Encoder for Natural Language Processing via Leveraging Contextual Information and Memory Attention

윤현구

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Hierarchical Context Encoder for Natural Language Processing via Leveraging Contextual Information and Memory Attention : 자연어 처리를 위한 문맥 정보 및 메모리 어텐션을 활용하는 계층적 문맥 인코더

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 윤현구

Advisor: 정교민

Issue Date: 2022

Publisher: 서울대학교 대학원

Keywords: deeplearning ; naturallanguageprocessing ; Transformer ; contextrepresentation ; representationsimilarity ; multi-modallearning

Description: 학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 정교민.

Abstract: 최근 자연어 처리(NLP)를 위한 표준 아키텍처가 순환 신경망에서 트랜스포머 아키텍처로 발전했다. 트랜스포머 아키텍처는 토큰 간의 상관 관계를 추출하는 데 강점을 보여주고 추출한 정보를 통합하여 적절한 출력을 생성하는 attention layer들로 구성된다. 이러한 발전은 최근 딥 러닝 사회에 주어진 입력 데이터 밖의 추가 컨텍스트 정보를 활용하는 새로운 도전을 제시했다. 본 학위 논문에서는 다양한 자연어 처리 작업에서 주어진 입력 외에 추가적인 컨텍스트 정보를 효과적으로 활용하는 새로운 방법과 분석을 attention layer에 초점을 맞추어 제안한다. 먼저, 이전 문장에 대한 컨텍스트 정보를 효율적으로 내장하고, 메모리 어텐션 메커니즘을 통해 내장된 문맥 표현을 입력 표현에 융합하는 계층적 메모리 컨텍스트 인코더(HMCE)를 제안한다. 제안된 HMCE는 다양한 문맥 인지 기계 번역 작업에서 추가 문맥 정보를 활용하지 않는 트랜스포머와 비교하였을 때 더 뛰어난 성능을 보인다. 그런 다음 문맥 표현과 입력 표현 사이의 어텐션 메커니즘을 개선하기 위해 문맥 표현과 입력 표현 사이의 표현 유사성을 Centered Kernel Alignment(CKA)를 이용하여 심층 분석하며, CKA를 최적화하는 방법을 제안한다. 마지막으로, 문맥 정보가 시각 양식으로 주어지는 다중 모달 시나리오에 대해 CKA 최적화 방법을 모달리티 정렬 방법으로 확장한다. 이 Modality Alignment 방법은 멀티 모달간 표현 유사성을 극대화하여 비디오 질문 응답 작업에서 큰 성능 향상을 가져온다.
Recently, the standard architecture for Natural Language Processing (NLP) has evolved from Recurrent Neural Network to Transformer architecture. Transformer architecture consists of attention layers which show its strength at finding the correlation between tokens and incorporate the correlation information to generate proper output. While many researches leveraging Transformer architecture report the new state-of-the-arts performances on various NLP tasks, These recent improvements propose a new challenge to deep learning society: exploiting additional context information. Because human intelligence perceives signals in everyday life with much rich contextual information (e.g. additional memory, visual information, and common sense), exploiting the context information is a step forward to the ultimate goal for Artificial Intelligence.

In this dissertation, I propose novel methodologies and analyses to improve context-awareness of Transformer architecture focusing on the attention mechanism for various natural language processing tasks. The proposed methods utilize the additionally given context information, which is not limited to the modality of natural language, aside the given input information. First, I propose Hierarchical Memory Context Encoder (HMCE) which efficiently embeds the contextual information over preceding sentences via a hierarchical architecture of Transformer and fuses the embedded context representation into the input representation via memory attention mechanism. The proposed HMCE outperforms the original Transformer which does not leverage the additional context information on various context-aware machine translation tasks. It also shows the best performance evaluated in BLEU among the baselines using the additional context. Then, to improve the attention mechanism between context representation and input representation, I deeply analyze the representational similarity between the context representation and the input representation. Based on my analyses on representational similarity inside Transformer architecture, I propose a method for optimizing Centered Kernel Alignment (CKA) between internal representations of Transformer. The proposed CKA optimization method increases the performance of Transformer in various machine translation tasks and language modelling tasks. Lastly, I extend the CKA optimization method to Modality Alignment method for multi-modal scenarios where the context information takes the modality of visual information. My Modality Alignment method enhances the cross-modality attention mechanism by maximizing the representational similarity between visual representation and natural language representation, resulting in performance improvements larger than 3.5% accuracy on video question answering tasks.

Language: eng

URI: https://hdl.handle.net/10371/187726

https://dcollection.snu.ac.kr/common/orgView/000000172829

Files in This Item:

000000172829.pdf 4.81 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share