On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks

이루카스

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 이루카스

Advisor: 성원용

Major: 공과대학 전기·정보공학부

Issue Date: 2019-02

Publisher: 서울대학교 대학원

Description: 학위논문 (석사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 성원용.

Abstract: 오늘날, 자동 음성 인식 시스템으로 인공신경망 기반의 알고리즘이 주요하게 활용되고 있다. 그런 가운데, 스마트폰이나 임베디드 장치에서 서버를 거치지 않고 진행되는 온-디바이스 음성 인식 시스템에 대한 수요가 증가하고 있다. 온-디바이스 음성 인식 시스템은 사용자의 음성이 서비스 제공자의 서버로 제공되지 않고, 음성인식이 사용자의 장치에서 독립적으로 이루어진다. 따라서, 프라이버시 침해와 보안에 대한 우려를 상당 부분 해소할 수 있다.

그러나, 인공신경망 기반의 음성 인식 시스템에서 주로 사용되는 LSTM 기반의 회귀신경망(RNN)은 온-디바이스 음성 인식에 효율적이지 않다. LSTM RNN은 시퀀스(sequence) 정보의 병렬화가 어렵다. 이는 LSTM RNN에는 현재의 시간 스텝(step)이 과거의 시간 스텝에 의존하는 되먹임(Feedback) 특성이 존재하기 때문이다. 또, 이 되먹임 정보는 너무 커서 캐시 메모리에 들어갈 수 없다. 따라서, 시퀀스 정보의 매 시간 스텝마다 DRAM에 접근하여 샘플을 불러와야 한다. 이 경우 매 시간 스텝마다 DRAM에 접근하여 전력소모가 증가할 뿐만 아니라, 실행 시간도 증가하게 된다.

우리는 이 논문에서 온-디바이스에 친화적인 인공신경말 모델을 제시한다. 이 모델들을 음향 모델링에 활용하여 LSTM RNN을 대체한다. 게이티드 콘볼루션 네크워크(Gated ConvNet), 대각성분 LSTM(Diagonal LSTM), QRNN(the quasi RNN)이 활용되었다. 이들 모델은 대부분의 연산에서 순서 의존성이 존재 하지 않아 시간 스텝별 병렬화가 가능하다. \\ \\ \\ \\

이들 모델들은 자동 음성 인식에서 1차원 깊이 콘볼루션(1D depthwise Convolution)이 추가된 후에는 LSTM RNN의 성능을 훨씬 능가하였다. 특히 게이티드 콘볼루션 네트워크의 경우 깊은 구조를 채택하였을 때, 음향 모델 없이 가장 좋은 성능을 보여주었다. 무엇보다도 온-디바이스에 효율적인 인공신경망 모델들은 시퀀스의 시간 스텝별 병렬화를 통해 실제 임베디드 장치에서 LSTM RNN 대비 최소 5배의 실행 속도 증가를 보여주었다.

우리는 여기서 더 나아가, 심플 게이티드 콘볼루션 네트워크(Simple Gated ConvNet)을 제시한다. 심플 게이티드 콘볼루션은 게이티드 콘볼루션의 가장 단순화 된 형태에 기반을 둔 것으로, 파라미터의 수가 혁명적으로 감소한다.
이는 하드웨어 사양의 제한을 받는 온-디바이스 음성인식에 유리한 특성이다. 또한 심플 게이티드 콘볼루션 네트워크는 시간 스텝 별 순서 의존성이 존재하지 않기 때문에 시간 스텝별 병렬화도 가능하다. 우리는 1차원 깊이 병렬화(1D depthwise convolution)을 여러 방향을 적용하여 성능 향상을 이끌어 내었다.

구체적으로, 우리는 심플 게이티드 콘볼루션 네크워크를 활용해 파라미터 사용량을 3 M 이하로 줄였다. 동일한 파라미터 수가 주어졌을 때 심플 게이티드 콘볼루션 네트워크는 자동 음성 인식에서 LSTM RNN이나 게이티드 콘볼루션 네트워크의 성능을 능가했다. 3 M 아래의 심플 게이티드 콘볼루션 네크워크는 10 M의 LSTM보다 더 좋은 성능을 보여주기도 하였다. 또한, 시간 스텝 별 병렬화를 통해서 ARM CPU에서 LSTM RNN 대비 10 배의 실행 속도 증가를 얻어냈다.
Automatic speech recognition (ASR) is widely adopted for smartphones and many embedded devices in recent years, and neural network based algorithms show the best performance for ASR. While most of ASR systems are based on server-based processing, there is an increasing demand for on-device speech recognition because of privacy concern and low latency processing. Reducing the power consumption is especially important for on-device speech recognition to lengthen the battery life.

Among several neural network models, recurrent neural network (RNN) based algorithms are mostly used for speech recognition, and long short-term memory(LSTM) RNN is most popular because of its superior performance over the other ones. However, executing LSTM RNN demands many DRAM accesses because the cache size of embedded devices is usually much smaller than the parameter size of RNN. Multi-time step parallelization technique computes multiple output samples at a time by fetching one set of parameters, and thus it can reduce the number of DRAM accesses in proportional to the number of time steps computed at a time. However, LSTM RNN does not permit the multi-time step parallelization because of complex feedback structure of the model.

This thesis presents neural network models that support efficient on-device speech recognition. First, a few models that permit multi-time step parallel processing are evaluated. The models evaluated include Gated ConvNet, Diagonal LSTM, and QRNN (quasi RNN). Since the performance of these models are not as good as the LSTM, one-dimensional depthwise convolution is added to improve the performance. The one-dimensional convolution helps finding the temporal patterns of speech signal. Second, Simple Gated Convolution Network (Simple Gated ConvNet) is proposed for improved performance when the parameter count is very small. The Simple Gated ConvNet employs the simplest form of Gated ConvNet. Instead it relies on one-dimensional convolution for temporal observation. Simple Gated ConvNet supports low-power on-device speech recognition because it can be executed employing multi-time step parallelization. The Simple Gated ConvNet under 3 million even shows better performance than the LSTM with 10 million parameters. In addition, the execution speed in ARM CPU can be increased more than ten-times compared with the LSTM RNN through multi-time step parallelization.

Language: eng

URI: https://hdl.handle.net/10371/150758

Files in This Item:

000000155354.pdf 2.53 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Master's Degree_전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share