*this notebook requires a working PyTorch GPU environment* 

# OpenAI's Whisper

...original notebook from https://github.com/fastforwardlabs/whisper-openai/blob/master/WhisperDemo.ipynb

Speech to text...

more information at
- https://openai.com/blog/whisper
- https://github.com/openai/whisper




In [3]:
%%capture
# install dependencies

! pip install git+https://github.com/openai/whisper.git

In [4]:
%%capture
# use imports and select cuda
import os
import numpy as np

try:
 import tensorflow 
except ImportError:
 pass

import torch
import pandas as pd
import whisper
import torchaudio

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display
import ipywidgets as widgets

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

2023-10-13 09:14:42.948361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# manually record using webcam - if u want to use a custom audio file, skip this section
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

In [6]:
# save recording as file and convert to wav
with open('recording.webm', 'wb') as f:
 f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic

In [7]:
# Whisper is capable of performing transcriptions for many languages (though it performs better for some languages and worse for others.) Whisper is also capable of detecting the input language. 
# However, to be on the safe side, we can explicitly tell Whisper which language to expect. 
language_options = whisper.tokenizer.TO_LANGUAGE_CODE 
language_list = list(language_options.keys())

In [8]:
# Whisper is also capable of several tasks, including English-only transcription, 
# Any-to-English translation, and non-English transcription. 
lang_dropdown = widgets.Dropdown(options=language_list, value='english')
output = widgets.Output()
display(lang_dropdown)

Dropdown(options=('english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portu…

In [9]:
task_dropdown = widgets.Dropdown(options=['transcribe', 'translate'], value='transcribe')
output = widgets.Output()
display(task_dropdown)

Dropdown(options=('transcribe', 'translate'), value='transcribe')

In [11]:
# load the model (takes some seconds)
# hint: Whisper comes in five model sizes, 
# four of which also have an optimized English-only version. 
# This notebook loads "base"-sized models (bigger than "tiny" but smaller than the others), which require about 1GB of RAM.

#If you selected English above, the cell below will load the optimized English-only version. Otherwise, it will load the multilingual model.

if lang_dropdown.value == "english":
 model = whisper.load_model("base.en")
else:
 model = whisper.load_model("base")
print(
 f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
 f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is English-only and has 71,825,408 parameters.


In [12]:
# set the options
options = whisper.DecodingOptions(language=lang_dropdown.value, task=task_dropdown.value, without_timestamps=True)
options

DecodingOptions(task='transcribe', language='english', temperature=0.0, sample_len=None, best_of=None, beam_size=None, patience=None, length_penalty=None, prompt=None, prefix=None, suppress_tokens='-1', suppress_blank=True, without_timestamps=True, max_initial_timestamp=1.0, fp16=True)

In [13]:
# choose your audio file
#audio = whisper.load_audio("my_recording.wav")
audio = whisper.load_audio("QA-01.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
result = model.decode(mel, options)

In [14]:
# print the text
result.text

'How many people are there in your family? There are five people in my family. My father, mother, brother, sister, and me. Does your family live in a house or an apartment? We live in a house in the countryside. What does your father do? My father is a doctor. He works at the local hospital. How old is your mother? She is 40 years old, one year younger than my father.'

In [16]:
# or write it into a text file

text_file = open("output.txt", "w")
text_file.write(result.text)
 
#close file
text_file.close()

In [17]:
# close all widgets
from ipywidgets import Widget
Widget.close_all()

In [18]:
# it takes around 2GB memory on GPU, so please clear it
from numba import cuda
device = cuda.get_current_device()
device.reset()