Exploring Whisper's Multitask Interface: A Closer Look at its Speech Transcription and Translation Capabilities

Part 3 of a multi-part series in which we delve deep into Whisper, OpenAI's state-of-the-art automatic speech recognition model

Feb 16, 2024

Overview

In the second part of the series, we talked about whisper’s architecture and how it actually generates text given an audio segment. In this part, we will discuss the multitask interface of whisper and how it’s capable of transcribing speech in the same language as well as translate it into English.

Whisper: A Versatile AI Tool

1. English Transcription

Whisper is known by most people for being the state-of-the-art (SOTA) AI tool for speech-to-text (STT). It can accurately transcribe audio segments even when they contain a lot of noise or multiple speakers with different, non-native accents like Indian English for example.

2. Multilingual Transcription

This is really impressive. But what’s even more impressive is that whisper can do this for other languages, not just English. It officially supports 96 languages (97 for large-v3), which means it can transcribe speech in these languages.

3. English Translation

Not only that, but whisper is also capable of translating speech in these languages into English.

For example, given an audio segment where someone is saying “Je m'appelle Marc”, we can instruct it to translate this into English and it can output something like “My name is Marc”. The authors refer to this capability as X → En, where X is any of the 96 supported languages aside from English.

4. Timestamp Prediction

And if you thought whisper is all about mumbling text, you’d be wrong. It’s also capable of predicting the timestamps of spoken utterances. An example of this is the following:

00:02 Hi! My name is Amgad. Nice to meet you! 02:45 Hello Amgad, my name is Marc. Nice to meet you, too! 05:55

This works with both transcription and translation.

5. Language Prediction

Whisper is also capable of predicting the language spoken in audio segment. Given an audio segment, it can return a probability distribution over all the possible languages. An example of this is the following: [“en”: 0.87, “es”: 0.1, “fr”: 0.05, …, “rs”: 0.0001] This a list of 96 values (one for each supported language) that sum to one (or 100%). In this example, we can see that whisper is pretty confident that the spoken language is English since it has a probability of 87% and the next language has a probability of only 10%.

6. Voice Detection

The last task it can do is voice detection. Given an audio segment, it outputs the probability of this audio segments containing any spoken words.

Recap

To summarize, whisper can do the following:

Predict the probability of speech in a segment
Predict the language spoken in a segment
Transcribe the speech in the same language without timestamps
Transcribe the speech in the same language with timestamps
Translate the speech into English without timestamps.
Translate the speech into English with timestamps.

So, how can whisper actually do all these tasks? Well, the OpenAI team came up with a very clever idea: a multitask interface to guide the decoder to generated the desired results.

The Multitask Interface

As we know from the previous part in this series, whisper is an encoder-decoder transformer model. We’ll briefly revisit how whisper works, but if you want to brush up on whisper’s architecture and how it works, please take a look at part 2 as it is important to understand the multitask interface.

The audio segment is sent to the encoder, which outputs a high dimensional vector that is a good representation of what’s spoken in the segment. This vector (the audio encoding) is then sent into the decoder, which generates the output in an autoregressive manner (i.e. token by token).

So how does this multitask interface work? Well, all the magic happens in the decoder.

The team at OpenAI developed special tokens to guide the decoding process to obtain the desired result. Let’s take a look at the following illustration.

The tokens in orange are special tokens. You can think of them as being similar to the reserved keywords in a programming language (e.g. the in operator in python).

They include:

The start-of-transcript token. This is used to tell the decoder to start the text prediction token. If you’re familiar with LLMs, It’s similar to the start token.
The language token. This token is used to specify the language of the speech in the audio segment.
The no-speech token. This is used by the decoder to do the voice detection. The probability of this token can be used as a prediction of whether the audio segment is void of any spoken words.
The task token. This token is used to specify the task that will be carried out by the decoder. It can be either translate or transcribe. If it’s transcribe, the decoder will return text in the same language. If it’s translate, it will return text in English.
The no-timestamps token. This is used to tell the decoder to predict only the text and not predict the timestamps of the spoken utterances.
The end-of-transcription token. This token is generated by the decoder when it believes it has completed transcribing this audio segment so that we can exit out of the decoding loop. It’s also similar to the eos token in LLMs.

But how do we actually guide the model’s prediction?

The answer is to add a sequence of special tokens to the decoder’s input. For example, if we have an audio segment where someone is speaking in Hindi and we want whisper to perform task number 3 in our list (i.e. transcribe the speech in the same language without timestamps), we simply pass the following sequence to the decoder:

[“start-of-transcript”, “Hindi”, “transcribe”, “no-timestamps”]

What if we instead want to translate what’s being said in Hindi into English?

Well, we pass the following sequence:

[“start-of-transcript”, “Hindi”, “translate”, “no-timestamps”]

What if we want it to also predict the timestamps in addition to the text?

Well, we simply remove the “no-timestamp” token from our sequence. It becomes like this:

For transcription: [“start-of-transcript”, “Hindi”, “transcribe”]
For translation: [“start-of-transcript”, “Hindi”, “translate”]

The decoder will return a sequence that contains the start time of a segment, the text in this segment, the end time of this segment. This can be repeated multiple times if the audio segment is relatively long and has a lot of speech.

What if we want to transcribe an audio segment in the same language but we actually don’t know the language spoken in the audio?

Well, we don’t specify the language. We replace it with a null object.

For transcription: [“start-of-transcript”, null, “transcribe”]
For translation: [“start-of-transcript”, null, “translate”]

This way, the decoder will predict the spoken language and start transcribing (or translating) accordingly.

Prompting Whisper

There is a final important piece to this interface. I intentionally left it out from the previous illustration as I thought it might cause confusion. Now that we have a good idea about the interface, we can take a look at the original illustration given below:

We notice that there’s an additional special token: prev. This token token is used to pass a prompt to whisper’s decoder. The prompt and this special token are prepended to the sequence that is passed into the decoder. The prompt can be one that’s usually passed to LLMs to specify the tone or set the stage for the predicted text. For example, we can set the prompt to “The following is a street interview” to tell whisper that you’re asked to transcribe a street interview.

Additionally, we can pass custom vocabulary in the prompt. For example, if we want to transcribe a youtube video about Andrej Karpathy, we can pass “Andrej Karpathy” into the prompt to encourage whisper to correctly transcribe his name as “Andrej Karpathy” instead of generating the more common spelling “Andrey Karpathy”. Similarly, we can pass “ChatGPT” to encourage it to write it in this way instead of returning “Chat Gee Pee Tee”.

Finally, we can pass the predictions from a preceding audio segment to help in the decoding process. This is helpful, for example, if one audio segment says “What’s the letter that comes after P?”. If we pass this text in the prompt, we’re hoping whisper will transcribe the current segment as “Q” instead of “queue”.

Recap

In summary, Whisper is a versatile AI tool that can perform various tasks, including English transcription, multilingual transcription, English translation, timestamp prediction, language prediction, and voice detection.

The model's ability to perform these tasks is made possible by a multitask interface that guides the decoder to generate the desired output. By adding a sequence of special tokens to the decoder's input, Whisper can be instructed to transcribe or translate speech in the same language or translate it into English.

The decoder also returns a sequence containing the start and end times of spoken segments, allowing for timestamp prediction. Furthermore, Whisper can predict the language spoken in an audio segment and perform voice detection by outputting the probability of the audio segment containing spoken words. By using prompts and custom vocabulary, users can guide Whisper's output to better match their desired results.

If you have any questions or feedback, feel free to leave a comment below!

Amgad’s Substack

Discussion about this post