Conformer-based Knowledge Distillation Speech Representation Learning Model

Abstract: Self-supervised learning (SSL) in speech involves training a network on large-scaled unlabeled speech corpora and using the learned representations from the network to perform downstream tasks. Commonly applied speech downstream tasks for SSL models are content related such as speech recognition and phoneme recognition. SSL models have achieved successful results in these tasks.

Despite their success, state of the art SSL models have limited usage for performing ASR in real-world situation. The major downside for using these models in real-word situation comes from their large parameter size. In order to effortlessly use SSL models, a resourceful computational environment where GPUs with enough memory space are available. The large parameter size further limits SSL models' application in on-device setting.

To overcome the shortcomings from the parameter size and the limited usage of SSL models in speech, we develop a parameter efficient using a distillation method and effectively reduce memory footprint with tolerable degradation in a downstream task performance. To be specific, we develop a parameter efficient network structure based on Conformer encoder instead of commonly used Transformer encoder. Our distillation method uses content informative teacher labels created by quantization of pre-trained teacher SSL model's representations. Furthermore, we have shown that our distillation method suits better for Conformer encoder instead of Transformer encoder.

To measure our model's representation performance, downstream tasks in speech processing universal performance benchmark (SUPERB) and ZeroSpeech2021 are used for evaluation. Our model shows par or better results than some of the existing SSL models in content related downstream tasks in SUPERB and outperformed the existing distillation model in phoneme recognition model. Moreover in Zerospeech2021, our model shows superior performance than the state of the art SSL models like ContentVec, HuBERR, and wav2vec2.0 in phonetic related tasks.

Furthermore, the techniques used for the distillation methods are investigated through ablation study. Our model trained by the proposed distillation method reduces the parameter size of the teacher model by 78% and increases the inference speed by 78% when only using CPUs and 99% when using single GPU.

Language: eng

URI: https://hdl.handle.net/10371/193426

https://dcollection.snu.ac.kr/common/orgView/000000174795

Files in This Item:

000000174795.pdf 8.89 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Program in Artificial Intelligence (협동과정-인공지능전공)
  - Theses (Master's Degree_협동과정-인공지능전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share