Skip to content

Latest commit

 

History

History
227 lines (202 loc) · 31.3 KB

audio-ai.md

File metadata and controls

227 lines (202 loc) · 31.3 KB

🏠Home

Audio

Compression

  • EnCodec SOTA deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio

Multiple Tasks

  • audio-webui A web-based UI for various audio-related Neural Networks with features like text-to-audio, voice cloning, and automatic-speech-recognition using Bark, AudioLDM, AudioCraft, RVC, coqui-ai and Whisper
  • tts-generation-webui for all things TTS, currently supports Bark v2, MusicGen, Tortoise, Vocos
  • Speechbrain A PyTorch-based Speech Toolkit for TTS, STT, etc
  • Nvidia NeMo TTS, LLM, Audio Synthesis framework
  • speech-rest-api for Speech-To-Text and Text-To-Speech with Whisper and Speechbrain
  • LangHelper language learning through Text-to-speech + chatGPT + speech-to-text to practise speaking assessments, memorizing words and listening tests
  • Silero-models pre-trained speech-to-text, text-to-speech and text-enhancement for ONNX, PyTorch, TensorFlow, SSML
  • AI-Waifu-Vtuber AI Waifu Vtuber & is a virtual streamer. Supports multiple languages and uses VoiceVox, DeepL, Whisper, Seliro TTS, and VtubeStudio, and now also supports Twitch streaming.
  • Voicebox large-scale text-guided generative speech model using non-autoregressive flow-matching, paper, demo, pytorch implementation, implementation
  • Auto-Synced-Translated-Dubs Automatic YouTube video speech to text, translation, text to speech in order to dub a whole video
  • SeamlessM4T Foundational Models for SOTA Speech and Text Translation
  • Amphion a toolkit for Audio, Music, and Speech Generation supporting TTS, SVS, VC, SVC, TTA, TTM
  • voicefixer restore human speech regardless how serious its degraded
  • VoiceCraft clone and edit an unseen voice with few seconds example and Text-to-Speech capabilities
  • audapolis an audio/video editor for spoken word media editing like a text editor using speech recognition
  • CosyVoice is a multi lingual voice generation model that supports inference, training, and deployment, zero-shot, cross-lingual voice cloning, and instruction-following capabilities, with features like Flow matching training, Repetition Aware Sampling inference, and streaming inference mode
  • Speech-AI-Forge is a gradio GUI and API server supporting multiple tasks and models such as ChatTTS, FishSpeech, CosyVoice, FireRedTTS for TTS, Whisper for ASR, and OpenVoice for voice conversion, with functionalities like speaker switching, style controls, long text inference, SSML scripting, and voice creation
  • voice-pro Gradio GUI for audio processing using whisper supporting YouTube Downloading, voice separation via UVR5, Speech recognition via Whisper, faster-whisper and whisper-timestamped, voice cloning via F5-TTS and E2-TTS, TTS via Edge-TTS

Speech Recognition

  • Whisper SOTA local open-source speech recognition in many languages and translation into English
    • Whisper JAX implementation runs around 70x faster on CPU, GPU and TPU
    • whisper.cpp C/C++ port for Intel and ARM based Mac OS, ANdroid, iOS, Linux, WebAssembly, Windows, Raspberry Pi
    • faster-whisper-livestream-translator A buggy proof of concept for real-time translation of livestreams using Whisper models, with suggestions for improvements including noise reduction and dual language subtitles
    • Buzz Mac GUI for Whisper
    • whisperX Fast automatic speech recognition (70x realtime with large-v2) using OpenAI's Whisper, word-level timestamps, speaker diarization, and voice activity detection
    • distil-whisper a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets, paper, model, hf
    • whisper-turbo a fast, cross-platform Whisper implementation, designed to run entirely client-side in your browser/electron app.
    • faster-whisper Faster Whisper transcription with CTranslate2
    • whisper-ctranslate2 a command line client based on faster-whisper and compatible with the original client from openai/whisper
    • whisper-diarization a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo
    • whisper-standalone-win portable ready to run binaries of faster-whisper for Windows
    • asr-sd-pipeline scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines
    • insanely-fast-whisper opinionated CLI to transcribe audio to text using whisper v3 on edge devices using optimum and flash attention
    • insanely-fast-whisper-cli The fastest Whisper optimization for automatic speech recognition as a command-line interface
    • WhisperLive real time transcription using voice activity detection and TensorRT or FasterWhisper backends
    • Whisper Medusa speed improvements by multi token prediction per iteration maintaining almost similar quality
  • ermine-ai | Whisper in the browser using transformers.js
  • wav2vec2 dimensional emotion model
  • MeetingSummarizer using Whisper and GPT3.dd
  • Facebook MMS: Speech recognition of over 1000 languages
  • Moonshine Speech to text models optimized for fast and accurate inference on edge devices outperforming Whisper
  • RealtimeSTT is a low-latency real time speech-to-text library, with advanced voice activity detection, wake word activation, and instant transcription using a combination of WebRTCVAD, SileroVAD, Faster_Whisper, and Porcupine or OpenWakeWord
  • FunASR speech recognition toolkit supports training, fine-tuning of models, offers features like ASR, VAD, Punctuation Restoration, Language Models, Speaker Verification, Diarization, multi-talker ASR, provides pre-trained models including Paraformer-large

voice activity detection (VAD):

  • Silero-VAD pre-trained enterprise-grade real tie Voice Activity Detector
  • libfvad fork of WebRTC VAD engine as a standalone library independent from other WebRTC features
  • voice_activity_detection Voice Activity Detection based on Deep Learning & TensorFlow
  • rVADfast unsupervised, robust voice activity detection

subtitle generation:

  • subtitler on-device web app for audio transcribing and rendering subtitles
  • pyvideotrans is a video translation and voiceover tool supporting STT, translation, TTS synthesis and audio separation, capable of translating videos into multiple languages while retaining background audio, and offering functionalities such as subtitle creation, batch translation, and audio-video merging
  • Whisper-WebUI Video Subtitle Generation via Gradio Interface supporting whisper, faster-whisper, insanely-fast-whisper, SRT, WebVTT, translation with Facebook NLLB or DeepL, Preprocessing via Silero VAD, UVR audio separation and speaker diarization via pyannote
  • noScribe Windows Mac and Linux GUI for audio transcription supporting whisper, faster-whisper, pyannote with built in GUI Editor
  • Vibe GUI for audio transcription supporting batch mode, SRT, VTT, HTML, JSON, realtime preview, summarization via Claude or Ollama, Whisper translation to English, custom Whisper Models, Transcribe system audio or microphone, CLI support, diarization, Swagger API,optimized for GPU (Mac, Windows, Linux) supporting Nvidia, AMD, Intel GPUs and Vulkan or CoreML
  • buzz Mac, Windows and Linux native GUI for whisper, whisper.cpp, faster-whisper and whisper-API supporting audio, microphone, txt, srt, vtt, transcription and translation and CLI

TextToSpeech

Voice Conversion

Video Voice Dubbing

  • weeablind dub multi lingual media using modern AI speech synthesis, diarization, and language identification
  • Auto-synced-translated-dubs Youtube audio translation and dubbing pipeline using Whisper speech-to-text, Google/DeepL Translate, Azure/Google TTS
  • videodubber dub video using GCP TTS, Translate, Whisper, Spacy tokenization and syllable counting
  • TranslatorYouTuber Takes a youtube video, clones the voice and re-creates that video in a different language
  • global-video-dubbing Using Googel Cloud Video Intelligence API with Cloud Translation API and Cloud Text to Speech API to generate voice dubbing and tranaslations in many languages automatically
  • wav2lip Lip Syncing from audio
  • video-retalking Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
  • Wunjo AI Synthesize & clone voices in English, Russian & Chinese, real-time speech recognition, deepfake face & lips animation, face swap with one photo, change video by text prompts, segmentation, and retouching. Open-source, local & free
  • YouTranslate Takes a youtube video, clones the voice with elevenlabs API translate the text with google translate API and re-creates that video in a different language
  • audio2photoreal Photoreal Embodiment by Synthesizing Humans including pose, hands and face in Conversations
  • TurnVoice Dubbing via CoquiTTS, Elevenlaps, OpenAI or Azure Voices, Translation, Speaking Style changer, Precise control via Editor, Background Audio Preservation
  • pyvideotrans is a video translation and voiceover tool supporting STT, translation, TTS synthesis and audio separation, capable of translating videos into multiple languages while retaining background audio, and offering functionalities such as subtitle creation, batch translation, and audio-video merging
  • SoniTranslate is a gradio based GUI for video translation and dubbing, OpenAI API for transcription, translation, and TTS, and supporting various output formats and multi-speaker TTS, with features like vocal enhancement, voice imitation, and extensive language support
  • VideoLingo subtitle transcription and audio dubbing using WHisperX, LLMs, GPT-SoVITS

Music Generation

  • audiocraft library for audio processing and generation with deep learning using EnCodec compressor / tokenizer and MusicGen support
    • audiocraft-infinity-webui webui supporting generation longer than 30 seconds, song continuation, seed option, load local models from chavinlo's training repo, MacOS/linux support, running on CPU/gpu
    • musicgen_trainer simple trainer for musicgen/audiocraft
    • audiocraft-webui basic webui with support for long audio, segmented audio and processing queue
    • audiocraft-webui another basic webui, unknown feature set
    • MusicGeneration a streamlit gui for audiocraft and musicgen
    • audiocraftgui with wxPython supporting continuous generation by using chunks and overlaps
    • MusicGen a simple and controllable model for music generation using a Transformer model examples, colab, colab collection
    • audiocraft-infinity-webui generation length over 30 seconds, ability to continue songs, seeds, allows to load local models
    • AudioCraft Plus an all-in-one WebUI for the original AudioCraft, adding multiband diffusion, continuation, custom model support, mono to stereo and more
  • AudioLDM Generate speech, sound effects, music and beyond, with text code, paper, HF demo
  • StableAudio Stability AI's Stable Audio only providing Training and Inference code, no models
  • SoundStorm-Pytorch a Pytorch implementation of Google Deepmind's SoundStorm, applying MaskGiT to residual vector quantized codes from Soundstream, using a Conformer transformer architecture for efficient parallel audio generation from text instructions

Audio Source Separation

  • Separate Anything You Describe Describe what you want to isolate from audio, Language-queried audio source separation (LASS), paper
  • Hybrid-Net Real-time audio source separation, generate lyrics, chords, beat by lamucal.ai
  • TubeSplitter Web application to extract and separate audio stems from YouTube videos using Flask, pytube, and spleeter
  • demucs Hybrid Transformer based source separation
    • streamstem web app utilizing yt-dlp, spotify-api and demucs for an end to end audio source separation pipeline
    • moseca web app for Music Source Separation & Karaoke utilizig demucs
    • MISST native windows GUI for demucs supporting youtube, spotify and files

Research

  • Vocos Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
  • WavJourney Compositional Audio Creation with LLMs github
  • PromptingWhisper Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation for Whisper
  • Translatotron 3 Unsupervised speech-to-speech translation from monolingual data

Benchmarks