Downloading FFmpeg
FFmpeg is a versatile and open source multimedia framework that facilitates the handling, conversion, and manipulation of audio and video files (https://ffmpeg.org/download.html).
To download FFmpeg for macOS, select static FFmpeg binaries for macOS 64-bit from https://evermeet.cx/ffmpeg/. Download ffmpeg-6.1.1.7z and extract and copy it to your <home directory>/<new folder>/bin. Change System preferences | Security and privacy| General, and then select Open anyway. Then, double-click the ffmpeg executable file.
To download FFmpeg for a Windows OS, select Windows builds by BtbN: https://github.com/BtbN/FFmpeg-Builds/releases. Download ffmpeg-master-latest-win64-gpl.zip. Extract and set the path environment variable of the extracted ffmpeg bin folder.
The code for this chapter is available at GitHub here: https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python/tree/main/code/Ch11.
Azure Machine Learning
If you want to explore the Whisper model along with other machine learning models available in the Azure Machine Learning model catalog, you can create a free Azure account at https://azure.microsoft.com/en-us/free. Then, you can try Azure Machine Learning for free at https://azure.microsoft.com/en-us/products/machine-learning/.
Real-time voice classification with Random Forest
In an era marked by the integration of advanced technologies into our daily lives, real-time voice classification systems have emerged as pivotal tools across various domains. The Python script in this section, showcasing the implementation of a real-time voice classification system using the Random Forest classifier from scikit-learn, is a testament to the versatility and significance of such applications.
The primary objective of this script is to harness the power of machine learning to differentiate between positive audio samples, indicative of human speech (voice), and negative samples, representing background noise or non-vocal elements. By employing the Random Forest classifier, a robust and widely used algorithm from the scikit-learn library, the script endeavors to create an efficient model capable of accurately classifying real-time audio input.
The real-world applications of this voice classification system are extensive, ranging from enhancing user experiences in voice-controlled smart devices to enabling automated voice commands in robotics. Industries such as telecommunications, customer service, and security can leverage real-time voice classification to enhance communication systems, automate processes, and bolster security protocols.
Whether it involves voice-activated virtual assistants, hands-free communication in automobiles, or voice-based authentication systems, the ability to classify and understand spoken language in real time is pivotal. This script provides a foundational understanding of the implementation process, laying the groundwork for developers and enthusiasts to integrate similar voice classification mechanisms into their projects and contribute to the evolution of voice-centric applications in the real world.
Let’s see the Python script that demonstrates a real-time voice classification system, using the Random Forest classifier from scikit-learn. The goal is to capture audio samples, distinguish between positive samples (voice) and negative samples (background noise or non-voice), and train a model for voice classification.
Let’s us see the Python code that provides a simple framework to build a real-time voice classification system, allowing you to collect your own voice samples to train and test the model:
Import the Python libraries: First, let’s import the requisite libraries using the following code snippet:
import numpy as np
import sounddevice as sd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Capture audio samples: The capture_audio function uses the sounddevice library to record real-time audio. The user is prompted to speak, and the function captures audio for a specified duration (the default is five seconds):
Function to capture real-time audio
def capture_audio(duration=5, sampling_rate=44100):
print(“Recording…”)
audio_data = sd.rec(int(sampling_rate * duration), \
samplerate=sampling_rate, channels=1, dtype=’int16′)
sd.wait()
return audio_data.flatten()
Collect training data: The collect_training_data function gathers training data for voice and non-voice samples. For positive samples (voice), the user is prompted to speak, and audio data is recorded using the capture_audio function. For negative samples (background noise or non-voice), the user is prompted to create ambient noise without speaking:
Function to collect training data
def collect_training_data(num_samples=10, label=0):
X = []
y = []
for _ in range(num_samples):
input(“Press Enter and speak for a few seconds…”)
audio_sample = capture_audio()
X.append(audio_sample)
y.append(label)
return np.vstack(X), np.array(y)
Main program
class VoiceClassifier:
def __init__(self):
self.model = RandomForestClassifier()
def train(self, X_train, y_train):
self.model.fit(X_train, y_train)
def predict(self, X_test):
return self.model.predict(X_test)
Collect positive samples (voice)
positive_X, positive_y = collect_training_data(num_samples=10, label=1)
Collect negative samples (background noise or non-voice)
negative_X, negative_y = collect_training_data(num_samples=10, label=0)