News In Brief Technology and Gadgets

Sarvam AI’s Saaras V3 Beats Gemini and GPT-4o on Indian Speech Benchmarks

525

12 Feb 2026

min read

News Synopsis

Indian AI startup Sarvam AI has released a new version of its speech recognition model, Saaras V3, and claims it outperforms several widely used global systems on benchmarks focused on Indian languages and Indian-accented English.

According to the company, Saaras V3 surpasses models including Google’s Gemini 3 Pro, OpenAI’s GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2 on key datasets designed specifically for Indian speech.

Benchmark Results Shared by Sarvam AI

Performance on IndicVoices Dataset

Sarvam AI co-founder Pratyush Kumar shared the benchmark results in a post on X, along with comparison charts showing Saaras V3’s performance against competing models on the IndicVoices and Svarah datasets.

According to him, Saaras V3 achieved a lower word error rate (WER) than the other systems across the most widely used Indian languages in the IndicVoices benchmark.

On a subset covering the 10 most popular languages in the IndicVoices dataset:

Saaras V3 recorded a word error rate of about 19.3 per cent
Competing models, including Gemini 3 Pro, GPT-4o Transcribe, Deepgram Nova-3, and Scribe v2, posted higher error rates

Sarvam also stated that the performance gap widens further on the remaining languages in the dataset, many of which are lower-resource Indian languages.

Strong Results on the Svarah Benchmark

The Svarah benchmark, which focuses on Indian-accented English from speakers across multiple Indian states, showed similar results.

According to figures shared by Sarvam, Saaras V3 again recorded the lowest word error rate among all compared systems, reinforcing its advantage on speech patterns common in India but often challenging for globally trained models.

Saaras V3: What’s New

Support for All 22 Scheduled Indian Languages

Sarvam says Saaras V3 is built on a new architecture and now supports all 22 scheduled Indian languages, along with English. This expanded coverage reflects a deliberate focus on India’s linguistic diversity.

Native Real-Time Streaming Speech Recognition

One of the most significant upgrades in Saaras V3 is native support for real-time, streaming speech recognition. Unlike batch-based systems that wait for an entire audio clip to finish, Saaras V3 can begin generating text while audio is still playing.

Sarvam says this allows the model to maintain accuracy close to batch mode while reducing latency, making it suitable for:

Live captions
Voice assistants
Call-centre tools
Real-time transcription

Training at Large Scale

According to Sarvam’s technical blog, Saaras V3 was trained on more than one million hours of multilingual audio, covering:

Multiple Indian languages
Diverse accents
Varied recording conditions

The training process focused heavily on code-mixed speech and noisy audio, which are common in real-world Indian usage.

Training included:

Large-scale pre-training
Supervised fine-tuning
Reinforcement learning
Post-training steps to reduce long-tail errors and improve consistency across languages

Beyond Basic Speech-to-Text

Sarvam positions Saaras V3 as more than a simple transcription engine.

Advanced Features

The model supports:

Automatic language detection
Word-level timestamps
Speaker diarisation, allowing it to separate and label different speakers in a conversation

These capabilities are designed for use cases such as:

Call analytics
Meeting transcripts
Media subtitling
Customer support and contact-centre workflows

Flexible Operating Modes

Sarvam has also introduced multiple operating modes that allow developers to balance latency and accuracy. These range from a “fast” mode optimised for low time-to-first-token to more accuracy-focused modes where transcription quality is the top priority.

Earlier Benchmark Claims: Sarvam Vision

The Saaras V3 results build on Sarvam’s earlier benchmark disclosures around Sarvam Vision, its document-focused AI system.

Previously, the company said Sarvam Vision outperformed several general-purpose models on tasks such as:

Document OCR
Layout understanding
Multi-script Indian documents

These tests included challenges like reading-order detection, table parsing, and complex page layouts—areas where models trained primarily on Western and English-language data often struggle.

Sarvam has argued that its task-specific design and training on Indian-language and Indian-format data explain the performance gap. Saaras V3 extends the same philosophy into speech recognition.

What Is Sarvam AI?

Sarvam AI is a Bengaluru-based startup focused on building speech, language, and multimodal AI systems for Indian use cases.

Rather than creating a single general-purpose chatbot, the company develops task-specific models, including:

Saaras – speech recognition
Bulbul – text-to-speech for Indian languages
Saarika – speech-to-text transcription
Mayura – text translation
Sarvam-M – multilingual reasoning language model

On the vision side, Sarvam Vision focuses on document understanding, while Samvaad is a voice-based conversational application built on top of Sarvam’s speech and language stack.

Sarvam AI is also one of the 12 startups selected under the Indian government’s IndiaAI mission to help develop indigenous multilingual and multimodal large language models.

Podcast

Editorial Segment

TWN Special