WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

The WavBench Team
WavBench Overview Results
Figure 1: Overview of WavBench results comparing five end-to-end spoken dialogue models across colloquial expression (Basic/Pro), explicit instruction understanding/generation, and implicit dialogue.

Overview

WavBench is a comprehensive benchmark tailored for authentic real-world scenarios, designed to evaluate both audio-centric colloquial semantics and paralinguistic fidelity of end-to-end spoken dialogue models. To rigorously judge responses based on inherent speech characteristics, we establish a holistic framework comprising three distinct tiers: a Pro subset to challenge reasoning agents with complex and discriminative tasks; a Basic subset to benchmark spoken adaptation; and an Acoustic set to assess comprehensive paralinguistic interactions.

Abstract

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents.

To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios.

Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models.

Key Features of WavBench

Tripartite Framework

Establish a holistic framework comprising Pro (Reasoning), Basic (Colloquialism), and Acoustic (Paralinguistics) subsets.

Massive Scale

Comprising 17,577 items totaling 76.5 hours, covering 7 cognitive domains and 10 paralinguistic attributes.

Implicit Interaction

Evaluates both explicit instructions and implicit multi-turn dialogues, requiring models to infer acoustic cues without direct prompts.

Illustrative Examples

Colloquial Expression

This set is constructed by aggregating high-quality samples from 15 open-source datasets across seven cognitive domains: Creative Writing, Instruction Following, Code, Math, QA, Safety, and Logic. It is organized into two tiers: Basic and Pro.

Examples of Colloquial Expression in WavBench
Figure 2: Examples of Colloquial Expression in WavBench.



Acoustic Interaction

This set comprises three components: Explicit Understanding, Explicit Generation, and Implicit Dialogue. It rigorously examines 10 paralinguistic dimensions, including speaker information (Age, Gender, Accent, Language), acoustic characteristics (Pitch, Speed, Volume, Emotion), and background sounds (Audio, Music). Specific attributes include: Age (Children, Adolescent, Middle-aged, Elderly), Gender (Male, Female), Accent (Indian, Canadian, British, Singaporean, American, Australian), Language (Chinese, English), Pitch (Low, Normal, High), Speed (Slow, Normal, Fast), Volume (Low, Normal, High), Emotion (Neutral, Happy, Sad, Angry, Surprised, Disgusted, Fearful), Audio (Wind noise, People crowd, Thunder, Cap gun shooting, Door slamming), and Music (Piano, Guitar, Drum).

Examples of Acoustic Interaction in WavBench
Figure 3: Examples of Acoustic Interaction in WavBench.

Audio Samples

Explicit Understanding Samples

Explicit Understanding - Accent

Explicit Understanding - Emotion

Explicit Understanding - Age

Explicit Understanding - Audio

Implicit Single-Turn Samples

Implicit Dialogue (Sample 1)

Implicit Dialogue (Sample 2)

Implicit Multi-Turn Samples (Contextual Dialogue)

Turn 1
Turn 2
Turn 3
Turn 4

Evaluation Results

Table 2: Overall Evaluation of WavBench. The evaluation is organized into five panels: (A) Colloquial Expression (Pro Subset); (B) Colloquial Expression (Basic Subset); (C) Explicit Acoustic Understanding; (D) Explicit Acoustic Generation; and (E) Implicit Acoustic Capability.

Metrics / Tasks Qwen3-Omni Kimi-Audio Mimo-Audio Step-Audio-2 GPT-4o Audio
Panel A: Colloquial Expression Capability - Pro Subset
Code39.7530.2928.9631.2053.60
Creativity48.3931.7842.8635.0063.00
Instruction43.0129.8636.4429.4057.80
Logic33.2126.0327.5726.2042.60
Math38.5527.3025.6822.4050.20
QA50.9342.5441.2840.8072.80
Safety60.0056.1956.1952.4067.60
Avg (Pro)39.5330.7932.0230.4058.23
Panel B: Colloquial Expression Capability - Basic Subset
Code53.1040.6942.0737.2058.00
Creativity57.4441.5745.2947.2071.20
Instruction57.2944.4133.5636.6066.80
Logic52.3550.7449.9148.8067.00
Math51.0541.2738.7330.2062.40
QA57.5449.0749.1248.6075.60
Safety59.6758.8362.8360.2081.00
Avg (Basic)55.8049.2349.5748.5068.80
Panel C: Acoustic Explicit Understanding
Accent37.5011.0027.0020.6715.67
Age64.3353.6753.0067.6720.33
Emotion92.8677.3377.3375.4385.90
Gender21.0044.5020.0068.0061.50
Language83.5091.0053.5096.5097.00
Pitch32.4423.1124.0034.2223.56
Speed46.6754.6748.8944.0048.00
Volume33.7838.2231.1150.6741.78
Audio Event61.7367.9019.7539.5159.26
Music22.2266.6755.5677.7833.33
Avg (Understand)49.6052.8041.0257.3648.70
Panel D: Acoustic Explicit Generation
Accent37.503.5223.4422.0774.22
Age64.6546.8851.9531.6478.12
Emotion90.0450.2957.1366.5095.51
Gender72.2745.3167.5859.7798.83
Language89.8474.8051.5691.4187.89
Pitch76.5647.2780.2755.6685.74
Speed43.7547.2751.5669.1466.60
Volume56.2564.0659.9657.0382.42
Audio27.0310.819.4632.4345.95
Music62.5020.8316.6770.8377.08
Avg (Generation)62.0341.1046.9355.6579.23
Panel E: Implicit Acoustic Interaction
Single-Turn (Text)1.851.842.231.122.43
Single-Turn (Audio)3.173.212.473.502.96
Multi-Turn (Text)4.884.574.614.384.48
Multi-Turn (Audio)1.251.081.041.211.23
Avg (Implicit)2.782.672.592.552.78

BibTeX

@article{wavbench2024,
  title={WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models},
  author={WavBench Team},
  journal={arXiv preprint},
  year={2024}
}