With these various wants in thoughts and presently specializing in High quality, I got down to search for and evaluate Hebrew ASR fashions — particularly evaluating:
I ran these comparisons in Amazon SageMaker Studio, with a Jupyter Pocket book working on ml.g5.4xlarge occasion (it will possibly run on a smaller occasion, however I wished the additional efficiency to run issues in parallel).
Operating Amazon Transcribe was the simplest, it has a very well documented API and utilizing boto3 SDK (AWS SDK for Python) it was straightforward to get this working. There aren’t many parameters to play with for fundamental transcription, all the parameters are largely function toggles — e.g. allow diarization, allow customized vocabulary, allow channel identification, and many others…
import boto3
import uuidtranscribe_client = boto3.consumer("transcribe")
file_uri = f's3://{bucket_name}/{object_key}'
media = file_uri.break up(".")[-1]
job_name = str(uuid.uuid4())
transcribe_client.start_transcription_job(
TranscriptionJobName=job_name,
Media={"MediaFileUri": file_uri},
MediaFormat=media,
)
The code above is what it’s good to begin a transcription job with no particular options enabled. You then question for the job completion and retrieve the output, which is a JSON containing the transcription + timestamps for all phrases & punctuations.
Wasn’t as easy, there are a number of parameters you’ll be able to tweak and choices for working the mannequin, I’ll cowl all of those within the subsequent part.
For Whisper v3 I used the next code as posted on their HuggingFace web page. I modified the code to obtain an area file and kwargs for Hebrew.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_datasetFILE_LOCAL = "HaTankistiot_n12-mp3.mp3"
machine = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
mannequin = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
mannequin.to(machine)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
mannequin=mannequin,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
machine=machine,
)
prediction = pipe(FILE_LOCAL, generate_kwargs={"language": "hebrew"})["text"]
For Whisper v2 I used the Long-Form Transcription as posted on thier HuggingFace web page. I modified the code to obtain an area file and kwargs for Hebrew.
import torch
from transformers import pipelineFILE_LOCAL = "HaTankistiot_n12-mp3.mp3"
machine = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
mannequin="openai/whisper-large-v2",
chunk_length_s=30,
batch_size=8,
machine=machine,
)
prediction = pipe(FILE_LOCAL, generate_kwargs={"language": "hebrew"})["text"]
I ran the next video via Amazon Transcribe, Whisper v2 & v3 — I used probably the most fundamental settings of every as printed within the official documentation.
Then I ran the outcomes utilizing JiWER — a easy and quick python package deal to judge ASR methods. It helps a number of measures, I’ve chosen to make use of these two:
- Phrase Error Fee (WER)
- Phrase Info Misplaced (WIL)
I’ve used a reference transcription file, which is word-for-word correct to the clip (manually verified by me). Simply earlier than working the JiWER analysis I’ve ran a small piece of code to take away all punctuations as they introduce important quantity of noise into the outcomes. Virtually talking a buyer is OK with unhealthy punctuation, he cares largely in regards to the precise phrase content material.
############### Amazon Transcribe Error ###############
WER (Phrase Error Fee) [lower is better] : 11.19%
WIL (Phrase Info Misplaced) [lower is better] : 18.67%############### Whisper V3 Error ###############
WER (Phrase Error Fee) [lower is better] : 39.07%
WIL (Phrase Info Misplaced) [lower is better] : 48.39%
############### Whisper V2 Error ###############
WER (Phrase Error Fee) [lower is better] : 41.07%
WIL (Phrase Info Misplaced) [lower is better] : 51.04%
As may be seen the WER is considerably higher utilizing Amazon Transcribe, I’ve tried tweaking the vanilla fashions to enhance this however I didn’t succeed. It’s a bit exhausting to imagine these are the outcomes, so in search of whisper specialists on the market to evaluation / touch upon the code (github hyperlink on the finish).
First off I’ll begin with kudos to the small workforce behind Ivrit.AI, they’re doing vital & difficult work with very restricted sources!
The workforce: Yair Lifshitz, Kinneret Misgav & Yanir Marmor.
As I’ve performed with the vanilla Whisper fashions, I’ve performed the identical right here — mannequin description on HuggingFace and used the directions posted there, right here as nicely I’ve used the Long-Form Transcription.
import torch
from transformers import pipelineFILE_LOCAL = "HaTankistiot_n12-mp3.mp3"
machine = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
mannequin="ivrit-ai/whisper-large-v2-tuned",
chunk_length_s=30,
batch_size=8,
machine=machine,
)
prediction = pipe(FILE_LOCAL, generate_kwargs={"language": "hebrew"})["text"]
As may be seen it’s equivalent to the Whisper v2 massive, with solely the mannequin title altering.
Operating this, as per the directions produces this, already an enchancment over Whipser v2 and v3. However once more this appeared unusual to me so I went forward on a journey to attempt to optimize it utilizing directions I’ve discovered over a number of scattered blogs. 🙂
############### Amazon Transcribe Error ###############
WER (Phrase Error Fee) [lower is better] : 11.19%
WIL (Phrase Info Misplaced) [lower is better] : 18.67%############### Whisper V3 Error ###############
WER (Phrase Error Fee) [lower is better] : 39.07%
WIL (Phrase Info Misplaced) [lower is better] : 48.39%
############### Whisper V2 Error ###############
WER (Phrase Error Fee) [lower is better] : 41.07%
WIL (Phrase Info Misplaced) [lower is better] : 51.04%
############### Whisper Ivrit Error ###############
WER (Phrase Error Fee) [lower is better] : 37.40%
WIL (Phrase Info Misplaced) [lower is better] : 45.54%
The ultimate code I’ve reached, which produces the most effective outcome I might get to, is that this:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline FILE_LOCAL = "HaTankistiot_n12-mp3.mp3"
machine = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "ivrit-ai/whisper-large-v2-tuned"
mannequin = WhisperForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch_dtype,
use_safetensors=True
)
mannequin.to(machine)
processor = WhisperProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
mannequin=mannequin,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=10,
stride_length_s=(4, 2),
batch_size=8,
torch_dtype=torch_dtype,
machine=machine,
)
prediction = pipe(FILE_LOCAL, generate_kwargs={"language": "hebrew"})["text"]
Discover the chunk_length being lowered to 10s (vs 30s in different examples) and the addition of stride_length — each chunking and stride are defined very nicely right here:
From the various articles I’ve learn and from my very own experiments — stride_length may be very unstable, typically producing very unhealthy outcomes by repeating the identical phrase / 2-words again and again — I’m unsure find out how to ‘tame’ this parameter and what it means for manufacturing methods.
The ultimate outcomes after doing my greatest to tune Ivrit.AI mannequin is that this:
############### Amazon Transcribe Error ###############
WER (Phrase Error Fee) [lower is better] : 11.19%
WIL (Phrase Info Misplaced) [lower is better] : 18.67%############### Whisper V3 Error ###############
WER (Phrase Error Fee) [lower is better] : 39.07%
WIL (Phrase Info Misplaced) [lower is better] : 48.39%
############### Whisper V2 Error ###############
WER (Phrase Error Fee) [lower is better] : 41.07%
WIL (Phrase Info Misplaced) [lower is better] : 51.04%
############### Whisper Ivrit Error ###############
WER (Phrase Error Fee) [lower is better] : 19.87%
WIL (Phrase Info Misplaced) [lower is better] : 31.71%
As may be seen a serious win for Ivrit.ai mannequin over Whisper v2 & v3 vanilla.
I’ve spoken with Yair from the Ivrit.ai workforce and requested if there are methods to enhance the outcomes additional in an effort to obtain probably the most correct comparability, there have been none that he might present at this second.
Massive Disclaimer: 1. This was examined on a single video (an excellent high quality video, however nonetheless just one) 2. I’m not an knowledgeable sufficient with Whisper to enhance the outcomes any additional, I’m totally open to feedback and suggestions to achieve the fairest and most full comparability. 3. My code posted to github is optimized for legibility solely, it may be shrunk by ~60% nevertheless it’s not the objective. 🙂 4. I did this as a aspect challenge