Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

A few weeks in the past, I posted an article about input-output knowledge alignments for the Automated Speech Recognition (ASR) process. Nevertheless, I assumed the enter knowledge was the identical throughout the fashions. This text discusses how we will lengthen the information preprocessing on speech knowledge with a number of methods.

The first method is the Handcrafted Knowledge or Characteristic Extraction. On this method, we decompose the information right into a Spectrogram. This extraction could possibly be helpful in capturing the sample of the audio knowledge by trying on the frequency part. Later, we might implement Mel Filtering to make sure a sample much like human listening to. The second method is to study the characteristic illustration. This could possibly be instantly carried out NN layer on prime of uncooked knowledge.

A. Handcrafted Knowledge: Spectrogram

The primary method is to remodel the sound wave of speech right into a spectrogram. There are a number of steps for doing this:

Illustration of transformation from soundwave into spectrogram by ASR Class on the College of Edinburgh

Dithering: Add a Gaussian noise to the information to keep away from zero values.
Eradicating DC offside (centering): Subtract the values with the imply to make it imply zero.
Pre-emphasize: emphasize the high-frequency part from the noise generated by the low-frequency part.
Discrete Fourier Remodel (DFT): Decompose the sign knowledge into frequency parts with Fourier Remodel.
Quick-Time Fourier Remodel (STFT): Extract the spectra with a 25ms window dimension and a 10ms hop.

Illustration for hop and window dimension

B. Handcrafted Knowledge: Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs present a greater match of information illustration to human sounds. In contrast to Spectrogram the place the primary goal is to visualise the audio knowledge into a number of frequency parts, MFCC makes use of Mel Filters to scale the audio sign right into a scale that’s near human notion of sound. Right here’s the additional particulars:

Generate Spectrogram knowledge.
Extract Mel Spectrogram: That is the dot product between Spectrogram knowledge with the n-Mel Filters. Initially, we’ve got to outline the variety of mel filters. Then, we carry out the dot product operation on each mel-filter.
Log Remodel: Use a log transformation to get the log-Mel spectrum.
Discrete Cosine Remodel (DCT): Apply DCT to the log remodel consequence to acquire the MFCCs.

Illustration of MFCC by ASR Class on the College of Edinburgh

Be taught extra on Spectrogram and MFCC: https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9

Generate Spectrogram utilizing Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html

C. Characteristic Studying: Convolutional Neural Community (CNN)

Basically, we might instantly implement the RNN Layer corresponding to LSTM or GRU on prime of the uncooked audio enter knowledge. Nevertheless, the variability because of the high-dimension audio knowledge results in inefficient mannequin studying. Then, introducing a Convolutional Neural Community (CNN) layer is preferable on the speech dataset. This CNN layer reduces the variability by remodeling right into a decrease dimensional characteristic illustration utilizing the convolution filters. The principle goal is to study the characteristic illustration of the wave sound, and later use this illustration because the alternative of the handcrafted knowledge (A & B) on the ASR system.

Learn Extra on Illustration Studying with Contrastive Predictive Coding (CPC): https://arxiv.org/abs/1807.03748

Comparability

On the time of writing this text, Oord et al. (2018) in contrast the handcrafted knowledge with the MFCC for full-supervised studying. The consequence confirmed that the handcrafted knowledge outperformed the characteristic studying (CNN layer). Nevertheless, the creation of MFCC could possibly be computationally costly throughout inference because of the a number of transformations corresponding to DFT, STFT, Mel-Filter, Logarithmic, and DCT. As additionally the handcrafted knowledge is inflexible, it might be appropriate just for sure use instances. Different examples just like the mannequin for language with out written varieties, the mannequin for music, or the mannequin for multi-task studying wouldn’t be appropriate. Therefore, the characteristic studying could possibly be an excellent various.

Reference:

Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Illustration studying with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Automated Speech Recognition (ASR) class by the College of Informatics College of Edinburgh
[accessed from ASR Slide] Chapter 1–5, Oppenheim, Willsky, and Nawab, “Indicators and Programs,” 1997
[accessed from ASR Slide] Chapter 2, O’Shaughnessy, “Speech Communications: Human and Machine,” 2000

Source link

Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Preparing Finance Data for AI: A 5-Step Data Cleansing Checklist

Our Picks

Significance of Hessian operators part7(Machine Learning) – Monodeep Mukherjee

Top Important LLMs Papers for the Week from 01/07 to 07/07 | by Youssef Hosni | Jul, 2024

Transforming Data into Insights: Building Robust Machine Learning Pipelines | by Thota Adinarayana | Jun, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Speech-to-Text: Data Preprocessing on Speech Recognition Task | by Alim Hanif | May, 2024

Related Posts