Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

Just a few weeks prior to now, I posted an article about input-output data alignments for the Automated Speech Recognition (ASR) course of. Nonetheless, I assumed the enter data was the an identical all through the fashions. This textual content discusses how we are going to lengthen the data preprocessing on speech data with quite a lot of strategies.

The first technique is the Handcrafted Data or Attribute Extraction. On this technique, we decompose the data proper right into a Spectrogram. This extraction might presumably be useful in capturing the pattern of the audio data by attempting on the frequency half. Later, we would implement Mel Filtering to verify a pattern very similar to human listening to. The second technique is to check the attribute illustration. This might presumably be immediately carried out NN layer on prime of raw data.

A. Handcrafted Data: Spectrogram

The first technique is to transform the sound wave of speech proper right into a spectrogram. There are a variety of steps for doing this:

Illustration of transformation from soundwave into spectrogram by ASR Class on the School of Edinburgh

Dithering: Add a Gaussian noise to the data to steer clear of zero values.
Eradicating DC offside (centering): Subtract the values with the suggest to make it suggest zero.
Pre-emphasize: emphasize the high-frequency half from the noise generated by the low-frequency half.
Discrete Fourier Transform (DFT): Decompose the signal data into frequency components with Fourier Transform.
Fast-Time Fourier Transform (STFT): Extract the spectra with a 25ms window dimension and a 10ms hop.

Illustration for hop and window dimension

B. Handcrafted Data: Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs current a larger match of knowledge illustration to human sounds. In distinction to Spectrogram the place the first objective is to visualise the audio data into quite a lot of frequency components, MFCC makes use of Mel Filters to scale the audio signal proper right into a scale that is close to human notion of sound. Proper right here’s the extra particulars:

Generate Spectrogram data.
Extract Mel Spectrogram: That’s the dot product between Spectrogram data with the n-Mel Filters. Initially, we have got to stipulate the number of mel filters. Then, we stock out the dot product operation on every mel-filter.
Log Transform: Use a log transformation to get the log-Mel spectrum.
Discrete Cosine Transform (DCT): Apply DCT to the log transform consequence to accumulate the MFCCs.

Illustration of MFCC by ASR Class on the School of Edinburgh

Be taught additional on Spectrogram and MFCC: https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9

Generate Spectrogram using Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html

C. Attribute Finding out: Convolutional Neural Group (CNN)

Mainly, we would immediately implement the RNN Layer similar to LSTM or GRU on prime of the raw audio enter data. Nonetheless, the variability due to the high-dimension audio data leads to inefficient model finding out. Then, introducing a Convolutional Neural Group (CNN) layer is preferable on the speech dataset. This CNN layer reduces the variability by reworking proper right into a lower dimensional attribute illustration using the convolution filters. The precept objective is to check the attribute illustration of the wave sound, and later use this illustration as a result of the choice of the handcrafted data (A & B) on the ASR system.

Be taught Additional on Illustration Finding out with Contrastive Predictive Coding (CPC): https://arxiv.org/abs/1807.03748

Comparability

On the time of scripting this textual content, Oord et al. (2018) in distinction the handcrafted data with the MFCC for full-supervised finding out. The consequence confirmed that the handcrafted data outperformed the attribute finding out (CNN layer). Nonetheless, the creation of MFCC might presumably be computationally expensive all through inference due to the quite a lot of transformations similar to DFT, STFT, Mel-Filter, Logarithmic, and DCT. As moreover the handcrafted data is rigid, it is likely to be applicable only for certain use situations. Completely different examples similar to the model for language with out written varieties, the model for music, or the model for multi-task finding out would not be applicable. Subsequently, the attribute finding out might presumably be a wonderful numerous.

Reference:

Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Illustration finding out with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Automated Speech Recognition (ASR) class by the School of Informatics School of Edinburgh
[accessed from ASR Slide] Chapter 1–5, Oppenheim, Willsky, and Nawab, “Indicators and Packages,” 1997
[accessed from ASR Slide] Chapter 2, O’Shaughnessy, “Speech Communications: Human and Machine,” 2000

Source link

Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

The Engine Room Of E-Commerce: A Glimpse Inside Fulfilment Centres | by Tycoonstories | Jul, 2024

How Fair representation learning works part4(Machine Learning) | by Monodeep Mukherjee | May, 2024

Research on Sequential Recommendations part3(Machine Learning 2024) | by Monodeep Mukherjee | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

Related Posts