Indian AI startup Sarvam AI has released a new version of its speech recognition model, Saaras V3, and claims it outperforms several widely used global systems on benchmarks focused on Indian languages and Indian-accented English.
According to the company, Saaras V3 surpasses models including Google’s Gemini 3 Pro, OpenAI’s GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2 on key datasets designed specifically for Indian speech.
Sarvam AI co-founder Pratyush Kumar shared the benchmark results in a post on X, along with comparison charts showing Saaras V3’s performance against competing models on the IndicVoices and Svarah datasets.
According to him, Saaras V3 achieved a lower word error rate (WER) than the other systems across the most widely used Indian languages in the IndicVoices benchmark.
On a subset covering the 10 most popular languages in the IndicVoices dataset:
Saaras V3 recorded a word error rate of about 19.3 per cent
Competing models, including Gemini 3 Pro, GPT-4o Transcribe, Deepgram Nova-3, and Scribe v2, posted higher error rates
Sarvam also stated that the performance gap widens further on the remaining languages in the dataset, many of which are lower-resource Indian languages.
The Svarah benchmark, which focuses on Indian-accented English from speakers across multiple Indian states, showed similar results.
According to figures shared by Sarvam, Saaras V3 again recorded the lowest word error rate among all compared systems, reinforcing its advantage on speech patterns common in India but often challenging for globally trained models.
Sarvam says Saaras V3 is built on a new architecture and now supports all 22 scheduled Indian languages, along with English. This expanded coverage reflects a deliberate focus on India’s linguistic diversity.
One of the most significant upgrades in Saaras V3 is native support for real-time, streaming speech recognition. Unlike batch-based systems that wait for an entire audio clip to finish, Saaras V3 can begin generating text while audio is still playing.
Sarvam says this allows the model to maintain accuracy close to batch mode while reducing latency, making it suitable for:
Live captions
Voice assistants
Call-centre tools
Real-time transcription
According to Sarvam’s technical blog, Saaras V3 was trained on more than one million hours of multilingual audio, covering:
Multiple Indian languages
Diverse accents
Varied recording conditions
The training process focused heavily on code-mixed speech and noisy audio, which are common in real-world Indian usage.
Training included:
Large-scale pre-training
Supervised fine-tuning
Reinforcement learning
Post-training steps to reduce long-tail errors and improve consistency across languages
Sarvam positions Saaras V3 as more than a simple transcription engine.
The model supports:
Automatic language detection
Word-level timestamps
Speaker diarisation, allowing it to separate and label different speakers in a conversation
These capabilities are designed for use cases such as:
Call analytics
Meeting transcripts
Media subtitling
Customer support and contact-centre workflows
Sarvam has also introduced multiple operating modes that allow developers to balance latency and accuracy. These range from a “fast” mode optimised for low time-to-first-token to more accuracy-focused modes where transcription quality is the top priority.
The Saaras V3 results build on Sarvam’s earlier benchmark disclosures around Sarvam Vision, its document-focused AI system.
Previously, the company said Sarvam Vision outperformed several general-purpose models on tasks such as:
Document OCR
Layout understanding
Multi-script Indian documents
These tests included challenges like reading-order detection, table parsing, and complex page layouts—areas where models trained primarily on Western and English-language data often struggle.
Sarvam has argued that its task-specific design and training on Indian-language and Indian-format data explain the performance gap. Saaras V3 extends the same philosophy into speech recognition.
Sarvam AI is a Bengaluru-based startup focused on building speech, language, and multimodal AI systems for Indian use cases.
Rather than creating a single general-purpose chatbot, the company develops task-specific models, including:
Saaras – speech recognition
Bulbul – text-to-speech for Indian languages
Saarika – speech-to-text transcription
Mayura – text translation
Sarvam-M – multilingual reasoning language model
On the vision side, Sarvam Vision focuses on document understanding, while Samvaad is a voice-based conversational application built on top of Sarvam’s speech and language stack.
Sarvam AI is also one of the 12 startups selected under the Indian government’s IndiaAI mission to help develop indigenous multilingual and multimodal large language models.