Whisper is designed to transcribe audio, but it requires a specific format for processing. The format required by Whisper for processing audio is WAV format. Whisper is designed to transcribe audio in WAV format, and it may not directly support other formats. Therefore, audio data that needs to be processed by Whisper should be provided in the WAV format. FFmpeg acts as a bridge by converting various audio formats (such as MP3, WAV, or AAC) into a format that Whisper can handle.
For example, if the input is in the MP3 format, FFmpeg can convert it to a format suitable for Whisper. Whisper typically requires audio data in WAV format, so FFmpeg can convert the input MP3 file to WAV during the process. This conversion allows the audio data to be in a format compatible with the requirements of the Whisper model.
Without this conversion, Whisper wouldn’t be able to process the audio effectively.
In scenarios where real-time transcription is needed (such as streaming a real-time messaging protocol (RTMP) feed), FFmpeg helps segment the audio stream. It splits the continuous audio into smaller chunks (e.g., 30-second segments) that can be processed individually. Each segment is then passed to Whisper for transcription:
Set the FFMPEG environment variable to the path of your ffmpeg executable
os.environ[‘PATH’] = ‘/<your_path>/audio-orchestrator-ffmpeg/bin:’ + os.environ[‘PATH’]
The code sets the FFmpeg environment variable to the path of the ffmpeg executable. This is necessary for handling audio and video files.
Step 4 – transcribing the YouTube audio using the Whisper model
Now, let’s transcribe the YouTube audio using the Whisper model:
model = whisper.load_model(‘base’)
text = model.transcribe(‘Mel Spectrograms with Python and Librosa Audio Feature Extraction.mp4’)
printing the transcribe
text[‘text’]
Here’s the output:
Figure 11.4 – A snippet of the code output
The Whisper model is loaded again to ensure that it uses the base model. The transcribe function is called on the model with the filename of the audio file as an argument. The resulting transcribed text is printed using text[‘text’].
Note
The provided filename in model.transcribe is Mel Spectrograms with Python and Librosa Audio Feature Extraction.mp4. Make sure this file exists and is accessible for the code to transcribe successfully.
Now, let’s see another code example on how to transcribe an audio file to text:
model = whisper.load_model(‘base’)
text = model.transcribe(‘/Users/<username>/PacktPublishing/DataLabeling/Ch11/customer_call_audio.m4a’)
printing the transcribe
text[‘text’]
Here is the output:
‘ Hello, I have not received the product yet.
I am very disappointed.
Are you going to replace if my product is damaged or missed?
I will be happy if you replace with your product in case I miss the product due to incorrect shipping address.’
Now, let’s perform sentiment analysis to label this text transcribed from a customer call.