The Making of Whisper: An In-Depth Exploration of its Training Data and Process

A Multi-part series in which we delve deep into whisper, OpenAI's state-of-the-art automatic speech recognition model

Dec 22, 2023

Overview

OpenAI's Whisper, a revolutionary Automatic Speech Recognition (ASR) model, emerged in late 2022 as a beacon of innovation, outshining industry giants like Google and Amazon in terms of accuracy and robustness*.

What sets Whisper apart is not just its stellar performance, but OpenAI's unprecedented move to release both its architecture and checkpoints under the Apache 2.0 license, offering enthusiasts the freedom to download, de Iploy, and fine-tune this cutting-edge model.

In this blog series, we embark on a deep dive into Whisper, aiming to unravel the mysteries behind its groundbreaking success. Our journey begins with an exploration of the research paper OpenAI published, shedding light on the intricate development process that led to Whisper's unparalleled capabilities.

_{*On some benchmarks}

The Training Dilemma

In the realm of machine learning, training models can be approached in two primary ways: Supervised Training and Unsupervised Training.

Unsupervised Training

This type of training doesn’t require labels, that means we can train the model on audio only without the need for expensive and labor-intensive labeling (creating transcripts of all audio recordings).

The model is supposed to "learn" from the audio without being showed how to transcribe the audio. This is why it's called unsupervised; nobody is showing it how to do the task.

Since we don't need labels, we can use large datasets of unlabeled speech. A scale of 1,000,000 (yep, that’s a million!) hours of training data has been achieved in SOTA research

The outcome of this type of training is a pre-trained audio encoder that learns high-quality representations of speech.

These pre-trained encoders lack an equivalently performant decoder that can map those representations to usable outputs (e.g. transcript/translation)

This is why they require a finetuning stage in order to actually perform a task such as speech recognition.

In the finetuning stage, we train the model on a much smaller (on the scale of 10k hours instead of 1M hours) labeled dataset (i.e. (audio, label) pairs)

When fine-tuned on standard benchmarks, this approach has improved the state of the art, especially in low-data settings.

However, this approach has many downsides. First, it can’t be used off the shelf. We need a fine-tuning step before we can use it to transcribe audio. This finetuning step requires a skilled practitioner (often a team of data scientists or machine leaning engineers)

Second, they are not robust; these models perform greatly in environments that are similar to the finetuning data but they don’t generalize well to other datasets and distributions

A model can achieve “superhuman” performance when trained on a dataset as it achieves a lower error than a human but it can still make many basic errors when evaluated on another dataset that has a different distribution.

Supervised Training

This type of training requires human-generated labels, i.e. we need (audio, text) pairs. Researchers often source audio recordings from different domains and label them manually. They also incorporate existing labeled datasets into their training.

Models trained in this manner can be robust; they generalize much more effectively to held-out datasets than models trained on a single source.

These works achieve this by combining as many existing high-quality speech recognition datasets as possible.

However, this approach is very limited. SOTA research (SpeechStew) mixes together 7 datasets totalling 5k hours of labeled data. This is around 208 days of constant audio, which is massive!

But compared to the 1M hours of unlabeled speech data in unsupervised training, this is just a drop in a bucket of water (more than 100x smaller).

This presents an important question: What if we scale supervised data to hundreds of thousands of hours e.g. 100k hours?

Unfortunately, this is a very difficult tasks as it requires millions of hours of human labor. Clearly, we need to re-think ways of generated labeled data on a massive scale. This is where weakly supervised training comes into play!

Weakly Supervised Training

Weakly supervised training aims to combine the scale of unsupervised training while maintaining the robustness and ease of use of (human) supervised training.

To do that, we need hundreds of thousands of hours of audio to meet the scaling criteria but we also need labels for these audio recordings to meet the robustness criteria and eliminate the need for a finetuning step.

How do we obtain massive amounts of labeled data? Well, the answer is to balance both the scale and quality of the training data.

Recent work in computer vision has demonstrated that moving beyond gold-standard crowdsourced datasets such as ImageNet to much larger but weakly supervised datasets significantly improves the robustness and generalization of models.

By relaxing the requirement of gold-standard human-validated transcripts, Chen et al. employ advanced automated pipelines to scale weakly supervised speech recognition. They achieve this by using 10,000 and 30,000 hours of noisier training data, which is a reasonable compromise between quality and quantity.

However, these datasets are two orders of magnitude smaller than datasets used in unsupervised manner.

The OpenAI team managed to scale the curated dataset to 680k hours of audio. This is roughly on the same order of unsupervised datasets.

*Depiction of OpenAI folks balancing quality and scale of training data, circa 2022*

The Whisper Method: Scaling to Unprecedented Heights

Whisper was trained on 680k hours of labeled audio! This is an order of magnitude larger than the previous SOTA (10k-30k hours)

The name whisper was inspired by Web-scale Supervised Pretraining for Speech Recognition (WSPSR). If you drop the second S, it becomes WSPR which can be pronounces similar to whipser

The way they created this massive dataset is basically by scrape the internet for (audio, text) pairs. This approach results in diverse audio as audio on the web comes from different environments, recording setups, speakers, and languages. Podcasts, live streams, audio books, plays, presidential speeches. The sky is the limit here.

Diversity in audio is good; diversity in transcripts is terrible!

While diverse audio is crucial for building a robust ASR that can generalize across different environments, sub-par labels can massively hamper a model's training process.

Upon inspection, the team found a large amount of subpar transcripts in the raw dataset. This includes things like partial or incomplete transcripts, grammatical errors or even outright wrong transcript that has nothing to do with the audio.

If these instances are kept in the data, this will bring significantly bring down its quality. And they can't just manually listen to every minute of audio and validate the transcript by themselves when dealing with data on this scale.

Solution: Automated Filtering

Many transcripts on the internet are not human-generated; but they are the output of existing ASR systems.

Research has shown that training on datasets of mixed human and machine-generated data can significantly impair the performance of transcription systems

The whisper team developed many heuristics to detect and remove machine-generated transcripts from the training dataset

For example, If a transcript is all UPPERCASE or all lowercase, it’s unlikely to be high-quality human generated since a professional transcriber will write coherent, capitalized and punctuated text.

Another way is to check whether the language of the audio matches the language of the text; if an audio is in Spanish and the text is in Japanese, it is very unlikely that the transcript is of high quality. To do this, they developed a machine learning model to detect the spoken language in an audio recording.

The final step of data preparation is to cut the audio into 30-seconds segments. This puts a hard limit on whisper. They also kept empty segments (i.e. audio with no speech) to train whisper to detect speech.

Manual Inspection is Key

Yet, automation wasn't the sole answer. OpenAI recognized the necessity of manual inspection. On their quest to filter-out sub-par labels, they opted for an additional filtering pass.

First, they trained an initial ASR model on the data they have.
Second, they evaluated this model on the different sources they used to curate the data. The outcome of this step is evaluation metrics like word error rate (wer) for different data sources (e.g. wer of 13% on source 1 and 27% on source 2)
They sorted the data sources by a combination of high error rate and data source size.
They identified and removed low-quality sources

This inspection showed a large amount of only partially transcribed or poorly aligned/misaligned transcripts and remaining low-quality machine-generated captions that filtering heuristics did not detect.

_{P.S. I'm currently working on a blog post about the importance of manually inspecting your data and how it's crucial to developing powerful ML models. Stay tuned if you're interested!}

Data leakage is the plague; avoid it at all costs!

To avoid misleading subsequent evaluation results, they perform de-duplication at a transcript level between the training dataset and the evaluation datasets that they thought were at higher risk of overlap.

Recap

If you made it to this part, you might a need a big picture of what happened. Here's a simplified recap of this blog post.

Unveiling Whisper: Scaling ASR Innovation

OpenAI's Whisper ASR model, released in 2022, disrupts the field with unparalleled accuracy. Going beyond industry giants, OpenAI freely shares the model and its architecture under Apache 2.0. The blog series starts by dissecting the research paper, revealing Whisper's secrets.

Training Dilemma: Weakly Supervised Brilliance

Whisper's strength lies in weakly supervised training, balancing unsupervised scale with supervised robustness. Leveraging a colossal 680k hours of labeled audio, it surpasses previous benchmarks.

Data Collection Marvel

Scouring the internet, OpenAI compiles diverse (audio, text) pairs for Whisper. Challenges arise with subpar transcripts, addressed through ingenious automated filtering techniques.

Automated Filtering: Precision Techniques

OpenAI employs heuristics to discern human-generated from machine-generated transcripts. Manual inspection complements automation, ensuring dataset integrity.

Guarding Against Data Leakage

De-duplication at the transcript level prevents data leakage, enhancing the reliability of evaluation outcomes.

Key Insights and Follow-up

OpenAI places a strong emphasis on data quality and approaches data curation with utmost seriousness. They not only prioritize data quality but also go above and beyond to ensure the highest standards in their training datasets.

Moreover, OpenAI is committed to maintaining transparency and integrity in the evaluation process for their trained models. Their vigilance extends to preventing data leakage and contamination during the training phase. It's no wonder they stand out as leaders in the rapidly evolving field of modern AI.

If you found this exploration of OpenAI's commitment to data quality intriguing, stay tuned for the upcoming Part 2, where we delve into the architecture of the whisper model.

References

Robust Speech Recognition via Large-Scale Weak Supervision, Radford et al

Amgad’s Substack

Discussion about this post