Hardware/Software Techniques for Efficient Embedding Operations in AI Applications

Abstract: The capacity of deep neural networks (DNNs) is rapidly growing to capture intricate connections between entities within a dataset, while the size of the dataset is also growing large and complex to encompass complicated information from the real world. To represent such complex nature of the dataset, embedding is commonly employed. Embedding is a vector projection of a high-dimensional sparse feature space to a low-dimensional dense space that preserves the semantics of the original features. This facilitates the ability of DNNs to process the dataset efficiently as an input.

The growth in the size of both models and datasets results in a high overhead of embedding operations leading to the overall slowdown in emerging AI applications such as recommender systems, NLP, and computer vision. As a result, these applications are facing three primary challenges: 1) an increase in memory bandwidth requirement 2) an increase in computational demand 3) inefficiency of the hardware. This dissertation explores three different AI applications experiencing these challenges and identifies different optimization opportunities for different embedding operations. Based on the findings, this dissertation introduces three proposals to address these challenges.

First, we present a new memoization framework called MERCI to optimize the embedding reduction operation in recommender systems. This novel memoization framework utilizes the co-appearing structure among features in a real-world dataset. To accomplish this, we introduce Correlation-Aware Variable-Sized Clustering, which identifies frequently co-appearing clusters of features with high coverage and small memoization table size. We also propose a feature remapping scheme to efficiently locate a partially reduced embedding using a small number of instructions.

Second, we introduce a software-hardware co-design work called ELSA for efficient self-attention operation which is widely used in natural language processing (NLP), recommender systems, and computer vision and etc. ELSA proposes a novel approximate self-attention algorithm that efficiently identifies relationships that are unlikely to impact the final outcome and skips computations related to them. We also introduce a hardware accelerator for our approximation algorithm to gain substantial speedup and energy saving by reduced computation.

Third, we present ANNA, a specialized architecture designed to accelerate compression-based approximate nearest neighbor search (ANNS). ANNA addresses the inefficiencies of commodity hardware (e.g., CPUs and GPUs) with carefully designed hardware modules and fine-grained pipelining. ANNA flexibly supports various search configurations and different ANNS libraries from large companies like Meta and Google. Also, ANNA employs a memory traffic optimization technique to support large-scale datasets more efficiently.

These proposals are highly relevant for practical use cases where models are deployed in data centers to provide services to end users. They can assist in meeting strict throughput and energy constraints associated with such use cases. Additionally, the findings of this research can serve as a valuable reference for future research and development in the field of AI.
심층신경망 (DNN)은 데이터 내에 객체간 복잡한 관계를 파악하기 위해서 빠르게 크기가 증가하고 있고, 데이터셋 또한 현실 세계에 존재하는 복잡한 정보들을 담기위해서 크기와 복잡도가 증가하고 있다. 이러한 복잡한 정보들을 담기위해서 임베딩이라는 자료구조가 가장 흔히 쓰이고 있다. 임베딩은 고차원 희소 공간을 원래 특징의 의미를 보존하는 저차원 밀도 공간으로 벡터 투영시킨 자료구조이다. 이를 통해 심층신경망이 데이터를 효율적으로 처리할 수 있다.

이렇게 모델과 데이터셋의 크기가 모두 증가하면서 임베딩 관련 연산들에 큰 오버헤드가 발생하여 추천 시스템, NLP 및 컴퓨터 비전과 같은 인공지능 애플리케이션들의 성능 저하를 야기시킨다. 이에 따라, 이 어플리케이션들은 크게 세가지의 어려움에 처해있다. 첫번째는 메모리 대역폭 사용량 증가, 두번째는 계산량의 증가, 세번째는 하드웨어의 비효율성이다. 이 논문은 이러한 문제를 갖고 있는 세 가지 다른 인공지능 애플리케이션을 분석하여 다양한 최적화 기회를 찾아낸다. 그리고 이에 근거하여 각 문제점을 해결하기 위한 세 가지 제안을 소개한다.

첫째로, 추천 시스템에서 임베딩 리덕션 연산을 최적화하기 위해 MERCI라는 새로운 메모아이제이션 프레임워크를 제시한다. 이 새로운 메모아이제이션 프레임워크는 실제 데이터셋의 특징들 중에 서로 같이 자주 등장하는 특징들의 집합 구조를 활용한다. 이를 위해 적은 추가 용량으로 높은 커버리지를 얻을 수 있는 특징 집합 구조를 식별하는 Correlation-Aware Variable-Sized Clustering을 제안한다. 또한, 다양한 테크닉을 도입하여 줄인 메모리 접근량의 효과를 최대화할 수 있도록 한다.

둘째로, 자연어 처리, 추천 시스템, 컴퓨터 비전 등에서 널리 사용되는 셀프 어텐션 연산의 효율적인 계산을 위한 ELSA라는 소프트웨어 하드웨어 최적화 연구를 소개한다. ELSA는 최종 결과에 영향을 미칠 가능성이 적은 관계들을 효율적으로 식별하고 관련된 계산들을 생략할 수 있는 새로운 근사 알고리즘을 제시한다. 또한, 이렇게 줄인 계산량으로 많은 속도 향상과 에너지 절약을 얻을 수 있는 하드웨어 가속기 디자인을 소개한다.

셋째로, 압축 기반 근접 이웃 검색 (ANNS)을 하드웨어 가속기를 통해 최적화하는 ANNA라는 연구를 소개한다. ANNA는 신중하게 설계된 모듈과 세분화된 파이프라이닝을 통해 기존 하드웨어 (예: CPU 및 GPU)의 비효율성을 해결한다. ANNA는 메타, 구글과 같은 대기업의 라이브러리들 다양한 검색 세팅을 유연하게 지원하고 메모리 트래픽 최적화 기술을 사용하여 대규모 데이터셋을 보다 효율적으로 지원한다.

본 논문에서 제시된 제안들은 모델이 데이터 센터에 채택되어 최종 사용자에게 서비스를 제공하는 현실적인 사용 사례에서 매우 관련성이 높다. 본 논문의 제안들은 이러한 사용 사례와 관련된 엄격한 처리량 및 에너지 제약 조건을 충족하는 데 도움이 될 수 있다. 또한, 이 연구 결과는 AI 분야에서 미래 연구 및 개발에 유용한 참고 자료로 활용될 수 있다.

Language: eng

URI: https://hdl.handle.net/10371/196494

https://dcollection.snu.ac.kr/common/orgView/000000178650

Files in This Item:

000000178650.pdf 7.99 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share