Microsoft has open-sourced VibeVoice, a frontier voice AI family that transcribes 60 minutes of continuous audio in a single pass, generates 90 minutes of speech with multiple speakers, and streams text-to-speech with 300ms latency. All under the MIT license. All free.
🔍 THE BOTTOM LINE: Voice AI that used to require expensive API subscriptions can now run locally for free. VibeVoice handles transcription, speaker identification, and speech generation — all in one open-source package.
🎙️ What VibeVoice Does
VibeVoice is three models in one family:
VibeVoice ASR — Speech Recognition:
- Processes 60 minutes of continuous audio in a single pass
- No chunking — global context is never lost
- Identifies WHO spoke, WHEN they spoke, and WHAT they said simultaneously
- Custom hotwords for domain-specific accuracy
- 50+ languages natively
- Already integrated into Hugging Face Transformers
VibeVoice TTS — Text to Speech:
- Generates up to 90 minutes of speech in a single pass
- Supports up to 4 distinct speakers in one conversation
- Natural turn-taking and speaker consistency
- Expressive speech capturing emotional nuances
- English, Chinese, and multiple other languages
VibeVoice Realtime — Streaming TTS:
- 300ms first audible latency
- Streams text input in real time
- 0.5B parameters — deploys anywhere
- 10-minute long-form generation
- Lightweight enough for production today
🔬 The Innovation
Most voice AI models slice long audio into short chunks. Every slice loses context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops.
VibeVoice uses continuous speech tokenizers running at an ultra-low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified.
This is a genuine architectural advance, not just a bigger model trained on more data.
💰 What This Replaces
| Feature | VibeVoice (Free) | Commercial Alternative | Monthly Cost |
|---|---|---|---|
| 60-min transcription | ✅ Single pass | Otter.ai Pro | $17 USD |
| Speaker diarisation | ✅ Built-in | AssemblyAI | $0.00054/s |
| Multi-speaker TTS | ✅ 4 speakers | ElevenLabs | $5–330 |
| Real-time streaming | ✅ 300ms | PlayHT | $20+ |
| On-device | ✅ Local | Most are cloud-only | Privacy risk |
Running all of this through commercial APIs would cost hundreds per month. VibeVoice does it for free, on your hardware, with no data leaving your machine.
🇳🇿 NZ Applications
For NZ businesses and creators:
- Podcast transcription — Process 60-minute episodes in one pass, with speaker labels, for free
- Meeting notes — Transcribe board meetings, client calls, or team standups locally
- Accessibility — Generate natural speech for content, locally and privately
- Māori and Pasifika languages — 50+ language support means potential for te reo Māori integration through fine-tuning
The 0.5B realtime model is small enough to run on a Mac Mini. The 7B ASR model needs a GPU but runs on a single consumer card.
🛠️ Getting Started
# ASR model on Hugging Face
# https://huggingface.co/microsoft/vibevoice-asr-7b
# Realtime TTS on Colab — try it now
# https://github.com/microsoft/VibeVoice
# Vibing input method (built on VibeVoice ASR)
# Available on macOS and Windows
GitHub: microsoft/VibeVoice — 40K+ stars and growing.
⚠️ Limitations
- The 7B ASR model needs a GPU with ~14GB VRAM (or Apple Silicon with unified memory)
- Realtime model (0.5B) quality is lower than the full TTS model
- Language coverage is strongest in English and Chinese — other languages may vary
- Fine-tuning code is available but requires technical expertise