Sarvam AI’s Saaras V3 Beats Gemini and GPT-4o on Indian Speech Benchmarks
News Synopsis
Indian AI startup Sarvam AI has released a new version of its speech recognition model, Saaras V3, and claims it outperforms several widely used global systems on benchmarks focused on Indian languages and Indian-accented English.
According to the company, Saaras V3 surpasses models including Google’s Gemini 3 Pro, OpenAI’s GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2 on key datasets designed specifically for Indian speech.
Benchmark Results Shared by Sarvam AI
Performance on IndicVoices Dataset
Sarvam AI co-founder Pratyush Kumar shared the benchmark results in a post on X, along with comparison charts showing Saaras V3’s performance against competing models on the IndicVoices and Svarah datasets.
According to him, Saaras V3 achieved a lower word error rate (WER) than the other systems across the most widely used Indian languages in the IndicVoices benchmark.
On a subset covering the 10 most popular languages in the IndicVoices dataset:
-
Saaras V3 recorded a word error rate of about 19.3 per cent
-
Competing models, including Gemini 3 Pro, GPT-4o Transcribe, Deepgram Nova-3, and Scribe v2, posted higher error rates
Sarvam also stated that the performance gap widens further on the remaining languages in the dataset, many of which are lower-resource Indian languages.
Strong Results on the Svarah Benchmark
The Svarah benchmark, which focuses on Indian-accented English from speakers across multiple Indian states, showed similar results.
According to figures shared by Sarvam, Saaras V3 again recorded the lowest word error rate among all compared systems, reinforcing its advantage on speech patterns common in India but often challenging for globally trained models.
Saaras V3: What’s New
Support for All 22 Scheduled Indian Languages
Sarvam says Saaras V3 is built on a new architecture and now supports all 22 scheduled Indian languages, along with English. This expanded coverage reflects a deliberate focus on India’s linguistic diversity.
Native Real-Time Streaming Speech Recognition
One of the most significant upgrades in Saaras V3 is native support for real-time, streaming speech recognition. Unlike batch-based systems that wait for an entire audio clip to finish, Saaras V3 can begin generating text while audio is still playing.
Sarvam says this allows the model to maintain accuracy close to batch mode while reducing latency, making it suitable for:
-
Live captions
-
Voice assistants
-
Call-centre tools
-
Real-time transcription
Training at Large Scale
According to Sarvam’s technical blog, Saaras V3 was trained on more than one million hours of multilingual audio, covering:
-
Multiple Indian languages
-
Diverse accents
-
Varied recording conditions
The training process focused heavily on code-mixed speech and noisy audio, which are common in real-world Indian usage.
Training included:
-
Large-scale pre-training
-
Supervised fine-tuning
-
Reinforcement learning
-
Post-training steps to reduce long-tail errors and improve consistency across languages
Beyond Basic Speech-to-Text
Sarvam positions Saaras V3 as more than a simple transcription engine.
Advanced Features
The model supports:
-
Automatic language detection
-
Word-level timestamps
-
Speaker diarisation, allowing it to separate and label different speakers in a conversation
These capabilities are designed for use cases such as:
-
Call analytics
-
Meeting transcripts
-
Media subtitling
-
Customer support and contact-centre workflows
Flexible Operating Modes
Sarvam has also introduced multiple operating modes that allow developers to balance latency and accuracy. These range from a “fast” mode optimised for low time-to-first-token to more accuracy-focused modes where transcription quality is the top priority.
Earlier Benchmark Claims: Sarvam Vision
The Saaras V3 results build on Sarvam’s earlier benchmark disclosures around Sarvam Vision, its document-focused AI system.
Previously, the company said Sarvam Vision outperformed several general-purpose models on tasks such as:
-
Document OCR
-
Layout understanding
-
Multi-script Indian documents
These tests included challenges like reading-order detection, table parsing, and complex page layouts—areas where models trained primarily on Western and English-language data often struggle.
Sarvam has argued that its task-specific design and training on Indian-language and Indian-format data explain the performance gap. Saaras V3 extends the same philosophy into speech recognition.
What Is Sarvam AI?
Sarvam AI is a Bengaluru-based startup focused on building speech, language, and multimodal AI systems for Indian use cases.
Rather than creating a single general-purpose chatbot, the company develops task-specific models, including:
-
Saaras – speech recognition
-
Bulbul – text-to-speech for Indian languages
-
Saarika – speech-to-text transcription
-
Mayura – text translation
-
Sarvam-M – multilingual reasoning language model
On the vision side, Sarvam Vision focuses on document understanding, while Samvaad is a voice-based conversational application built on top of Sarvam’s speech and language stack.
Sarvam AI is also one of the 12 startups selected under the Indian government’s IndiaAI mission to help develop indigenous multilingual and multimodal large language models.
You May Like


