Publications

Detailed Information

Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

Park, Jinhwan; Sung, Wonyong

Issue Date
2020-10
Publisher
ISCA-INT SPEECH COMMUNICATION ASSOC
Citation
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.46-50
Abstract
Attention-based models with convolutional encoders enable faster training and inference than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length can suffer from looping or skipping problems when the input utterance contains the same words as nearby sentences. We believe that this is due to the insufficient receptive field length, and try to remedy this problem by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of adding positional information. The proposed method improves the accuracy of attention models with a convolutional encoder and achieves a WER of 10.60% on TED-LIUMv2 for an end-to-end speech recognition task.
ISSN
1990-9772
URI
https://hdl.handle.net/10371/186299
DOI
https://doi.org/10.21437/Interspeech.2020-3163
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share