A few weeks in the past, I posted an article about input-output knowledge alignments for the Automated Speech Recognition (ASR) process. Nevertheless, I assumed the enter knowledge was the identical throughout the fashions. This text discusses how we will lengthen the information preprocessing on speech knowledge with a number of methods.
The first method is the Handcrafted Knowledge or Characteristic Extraction. On this method, we decompose the information right into a Spectrogram. This extraction could possibly be helpful in capturing the sample of the audio knowledge by trying on the frequency part. Later, we might implement Mel Filtering to make sure a sample much like human listening to. The second method is to study the characteristic illustration. This could possibly be instantly carried out NN layer on prime of uncooked knowledge.
A. Handcrafted Knowledge: Spectrogram
The primary method is to remodel the sound wave of speech right into a spectrogram. There are a number of steps for doing this:
- Dithering: Add a Gaussian noise to the information to keep away from zero values.
- Eradicating DC offside (centering): Subtract the values with the imply to make it imply zero.
- Pre-emphasize: emphasize the high-frequency part from the noise generated by the low-frequency part.
- Discrete Fourier Remodel (DFT): Decompose the sign knowledge into frequency parts with Fourier Remodel.
- Quick-Time Fourier Remodel (STFT): Extract the spectra with a 25ms window dimension and a 10ms hop.
B. Handcrafted Knowledge: Mel Frequency Cepstral Coefficients (MFCCs)
MFCCs present a greater match of information illustration to human sounds. In contrast to Spectrogram the place the primary goal is to visualise the audio knowledge into a number of frequency parts, MFCC makes use of Mel Filters to scale the audio sign right into a scale that’s near human notion of sound. Right here’s the additional particulars:
- Generate Spectrogram knowledge.
- Extract Mel Spectrogram: That is the dot product between Spectrogram knowledge with the n-Mel Filters. Initially, we’ve got to outline the variety of mel filters. Then, we carry out the dot product operation on each mel-filter.
- Log Remodel: Use a log transformation to get the log-Mel spectrum.
- Discrete Cosine Remodel (DCT): Apply DCT to the log remodel consequence to acquire the MFCCs.
Be taught extra on Spectrogram and MFCC: https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9
Generate Spectrogram utilizing Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html
C. Characteristic Studying: Convolutional Neural Community (CNN)
Basically, we might instantly implement the RNN Layer corresponding to LSTM or GRU on prime of the uncooked audio enter knowledge. Nevertheless, the variability because of the high-dimension audio knowledge results in inefficient mannequin studying. Then, introducing a Convolutional Neural Community (CNN) layer is preferable on the speech dataset. This CNN layer reduces the variability by remodeling right into a decrease dimensional characteristic illustration utilizing the convolution filters. The principle goal is to study the characteristic illustration of the wave sound, and later use this illustration because the alternative of the handcrafted knowledge (A & B) on the ASR system.
Learn Extra on Illustration Studying with Contrastive Predictive Coding (CPC): https://arxiv.org/abs/1807.03748
Comparability
On the time of writing this text, Oord et al. (2018) in contrast the handcrafted knowledge with the MFCC for full-supervised studying. The consequence confirmed that the handcrafted knowledge outperformed the characteristic studying (CNN layer). Nevertheless, the creation of MFCC could possibly be computationally costly throughout inference because of the a number of transformations corresponding to DFT, STFT, Mel-Filter, Logarithmic, and DCT. As additionally the handcrafted knowledge is inflexible, it might be appropriate just for sure use instances. Different examples just like the mannequin for language with out written varieties, the mannequin for music, or the mannequin for multi-task studying wouldn’t be appropriate. Therefore, the characteristic studying could possibly be an excellent various.
Reference:
- Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Illustration studying with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Automated Speech Recognition (ASR) class by the College of Informatics College of Edinburgh
- [accessed from ASR Slide] Chapter 1–5, Oppenheim, Willsky, and Nawab, “Indicators and Programs,” 1997
- [accessed from ASR Slide] Chapter 2, O’Shaughnessy, “Speech Communications: Human and Machine,” 2000