Detailed Information

Conformer-based Knowledge Distillation Speech Representation Learning Model : 컨포머 기반 지식증류 기법을 사용한 발화 표현 학습 모델

Cited 0 time in Web of Science Cited 0 time in Scopus


Issue Date
서울대학교 대학원
self-supervised learningspeech representation learningknowledge distillationmodel compression
학위논문(석사) -- 서울대학교대학원 : 공과대학 협동과정 인공지능전공, 2023. 2. 이교구.
Self-supervised learning (SSL) in speech involves training a network on large-scaled unlabeled speech corpora and using the learned representations from the network to perform downstream tasks. Commonly applied speech downstream tasks for SSL models are content related such as speech recognition and phoneme recognition. SSL models have achieved successful results in these tasks.

Despite their success, state of the art SSL models have limited usage for performing ASR in real-world situation. The major downside for using these models in real-word situation comes from their large parameter size. In order to effortlessly use SSL models, a resourceful computational environment where GPUs with enough memory space are available. The large parameter size further limits SSL models' application in on-device setting.

To overcome the shortcomings from the parameter size and the limited usage of SSL models in speech, we develop a parameter efficient using a distillation method and effectively reduce memory footprint with tolerable degradation in a downstream task performance. To be specific, we develop a parameter efficient network structure based on Conformer encoder instead of commonly used Transformer encoder. Our distillation method uses content informative teacher labels created by quantization of pre-trained teacher SSL model's representations. Furthermore, we have shown that our distillation method suits better for Conformer encoder instead of Transformer encoder.

To measure our model's representation performance, downstream tasks in speech processing universal performance benchmark (SUPERB) and ZeroSpeech2021 are used for evaluation. Our model shows par or better results than some of the existing SSL models in content related downstream tasks in SUPERB and outperformed the existing distillation model in phoneme recognition model. Moreover in Zerospeech2021, our model shows superior performance than the state of the art SSL models like ContentVec, HuBERR, and wav2vec2.0 in phonetic related tasks.

Furthermore, the techniques used for the distillation methods are investigated through ablation study. Our model trained by the proposed distillation method reduces the parameter size of the teacher model by 78% and increases the inference speed by 78% when only using CPUs and 99% when using single GPU.
Files in This Item:
Appears in Collections:


Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.