WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics

Overview

WavBench is a comprehensive benchmark tailored for authentic real-world scenarios, designed to evaluate both audio-centric colloquial semantics and paralinguistic fidelity of end-to-end spoken dialogue models. To rigorously judge responses based on inherent speech characteristics, we establish a holistic framework comprising three distinct tiers: a Pro subset to challenge reasoning agents with complex and discriminative tasks; a Basic subset to benchmark spoken adaptation; and an Acoustic set to assess comprehensive paralinguistic interactions.

Abstract

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents.

To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios.

Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models.

Key Features of WavBench

Tripartite Framework

Establish a holistic framework comprising Pro (Reasoning), Basic (Colloquialism), and Acoustic (Paralinguistics) subsets.

Massive Scale

Comprising 17,577 items totaling 76.5 hours, covering 7 cognitive domains and 10 paralinguistic attributes.

Implicit Interaction

Evaluates both explicit instructions and implicit multi-turn dialogues, requiring models to infer acoustic cues without direct prompts.

Illustrative Examples

Colloquial Expression

This set is constructed by aggregating high-quality samples from 15 open-source datasets across seven cognitive domains: Creative Writing, Instruction Following, Code, Math, QA, Safety, and Logic. It is organized into two tiers: Basic and Pro.

Acoustic Interaction

This set comprises three components: Explicit Understanding, Explicit Generation, and Implicit Dialogue. It rigorously examines 10 paralinguistic dimensions, including speaker information (Age, Gender, Accent, Language), acoustic characteristics (Pitch, Speed, Volume, Emotion), and background sounds (Audio, Music). Specific attributes include: Age (Children, Adolescent, Middle-aged, Elderly), Gender (Male, Female), Accent (Indian, Canadian, British, Singaporean, American, Australian), Language (Chinese, English), Pitch (Low, Normal, High), Speed (Slow, Normal, Fast), Volume (Low, Normal, High), Emotion (Neutral, Happy, Sad, Angry, Surprised, Disgusted, Fearful), Audio (Wind noise, People crowd, Thunder, Cap gun shooting, Door slamming), and Music (Piano, Guitar, Drum).

Audio Samples

Explicit Understanding Samples

Explicit Understanding - Accent

Explicit Understanding - Emotion

Explicit Understanding - Age

Explicit Understanding - Audio

Implicit Single-Turn Samples

Implicit Dialogue (Sample 1)

Implicit Dialogue (Sample 2)

Implicit Multi-Turn Samples (Contextual Dialogue)

Turn 1

Turn 2

Turn 3

Turn 4

Evaluation Results

Table 2: Overall Evaluation of WavBench. The evaluation is organized into five panels: (A) Colloquial Expression (Pro Subset); (B) Colloquial Expression (Basic Subset); (C) Explicit Acoustic Understanding; (D) Explicit Acoustic Generation; and (E) Implicit Acoustic Capability.

Metrics / Tasks	Qwen3-Omni	Kimi-Audio	Mimo-Audio	Step-Audio-2	GPT-4o Audio
Panel A: Colloquial Expression Capability - Pro Subset
Code	39.75	30.29	28.96	31.20	53.60
Creativity	48.39	31.78	42.86	35.00	63.00
Instruction	43.01	29.86	36.44	29.40	57.80
Logic	33.21	26.03	27.57	26.20	42.60
Math	38.55	27.30	25.68	22.40	50.20
QA	50.93	42.54	41.28	40.80	72.80
Safety	60.00	56.19	56.19	52.40	67.60
Avg (Pro)	39.53	30.79	32.02	30.40	58.23
Panel B: Colloquial Expression Capability - Basic Subset
Code	53.10	40.69	42.07	37.20	58.00
Creativity	57.44	41.57	45.29	47.20	71.20
Instruction	57.29	44.41	33.56	36.60	66.80
Logic	52.35	50.74	49.91	48.80	67.00
Math	51.05	41.27	38.73	30.20	62.40
QA	57.54	49.07	49.12	48.60	75.60
Safety	59.67	58.83	62.83	60.20	81.00
Avg (Basic)	55.80	49.23	49.57	48.50	68.80
Panel C: Acoustic Explicit Understanding
Accent	37.50	11.00	27.00	20.67	15.67
Age	64.33	53.67	53.00	67.67	20.33
Emotion	92.86	77.33	77.33	75.43	85.90
Gender	21.00	44.50	20.00	68.00	61.50
Language	83.50	91.00	53.50	96.50	97.00
Pitch	32.44	23.11	24.00	34.22	23.56
Speed	46.67	54.67	48.89	44.00	48.00
Volume	33.78	38.22	31.11	50.67	41.78
Audio Event	61.73	67.90	19.75	39.51	59.26
Music	22.22	66.67	55.56	77.78	33.33
Avg (Understand)	49.60	52.80	41.02	57.36	48.70
Panel D: Acoustic Explicit Generation
Accent	37.50	3.52	23.44	22.07	74.22
Age	64.65	46.88	51.95	31.64	78.12
Emotion	90.04	50.29	57.13	66.50	95.51
Gender	72.27	45.31	67.58	59.77	98.83
Language	89.84	74.80	51.56	91.41	87.89
Pitch	76.56	47.27	80.27	55.66	85.74
Speed	43.75	47.27	51.56	69.14	66.60
Volume	56.25	64.06	59.96	57.03	82.42
Audio	27.03	10.81	9.46	32.43	45.95
Music	62.50	20.83	16.67	70.83	77.08
Avg (Generation)	62.03	41.10	46.93	55.65	79.23
Panel E: Implicit Acoustic Interaction
Single-Turn (Text)	1.85	1.84	2.23	1.12	2.43
Single-Turn (Audio)	3.17	3.21	2.47	3.50	2.96
Multi-Turn (Text)	4.88	4.57	4.61	4.38	4.48
Multi-Turn (Audio)	1.25	1.08	1.04	1.21	1.23
Avg (Implicit)	2.78	2.67	2.59	2.55	2.78

BibTeX

@article{wavbench2024,
  title={WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models},
  author={WavBench Team},
  journal={arXiv preprint},
  year={2024}
}