Repo showcasing AI meeting transcription tool.
This repo showcase a basic tool for meeting transcription. It's targetted at meetings conducted in English, but with little tweaking could be used for other languages as well.
The tool works in a three step process:
- It extract audio path from given video file or YouTube link
- It generates speaker diarization (separating different speaker tracks) by using
pyannote/speaker-diarization-3.0
model - Finally it generates transcription using Open AI Whisper model. By default it uses Whisper
base.en
version but you can select other model sizes. The output is saved tooutput.sub
file in SubViewer format.
All processing is done locally on the users machine. The model weights are downloaded to local ~/.cache
folder (on macOS).
- Speaker Diarization 3.0 model weights around 6 MB
- Whisper Base.en model weights around 300 MB
Install following dependencies (on macOS):
ffmpeg
CLI -brew install ffmpeg
- Python 3 installation - e.g. Miniconda or Homebrew package.
- Python packages -
pip3 install -r requirements.txt
In order to download models used by these tool you need to:
- Generate a private Hugging Face auth token - instructions here
- Create
.env
file inside root repo folder with following content:
HUGGINGFACE_AUTH_TOKEN="your token here..."
- Accept
Speaker diarization 3.0
model terms of service - link here
In order to run Web UI just run python3 ./web-ui.sh
in the repo folder. This should open following Web UI interface in the browser.
The tool can be used as Jupyter Labs/Notebook as well, you open the Transcription.ipynb
in Jupyter Labs.
Speaker diarization steps is the longest part of moder execution. It roughly takes 30s for each 1 minute of the meeting to execute on M1 MacBook Pro.
- If you get following error
"Could not download 'pyannote/segmentation-3.0' model. It might be because the model is private or gated so make sure to authenticate."
then make sure you provided Hugging Face auth token AND acceptedSpeaker diarization 3.0
model terms of service.