FFASR: ASR Evaluation Leaderboard

Search models

Version

Moving-source evaluations are a beta feature.

Submit a model for evaluation

Paste a Hugging Face model id. The server loads every checkpoint with an automatic backend, so Whisper, IBM Granite speech, Cohere Transcribe, efficient‑speech / custom Whisper variants, Wav2Vec2 / HuBERT CTC heads, and SpeechBrain ASR, plus most other ASR stacks on the Hub are supported without a per‑submission choice.

Submissions are queued; when Hub Jobs are enabled, up to four models evaluate in parallel. Otherwise jobs run one at a time on this Space. When moderation is enabled, new requests wait for approval before they run.

Model ID

Complex model recipe (optional)

Pre-fills setup script, evaluate(), and deps for install-heavy stacks (e.g. Mega-ASR).

Optional extra Python requirements (one per line, requirements.txt format)

Optional one-time setup script (shell or Python, runs once per Hub Job)

Optional custom evaluator (Python)

Edit the example below: define evaluate(file: Path) -> str to transcribe one WAV. Load your model once at module level (as in the Cohere example). Use soundfile (or numpy) to read each WAV (see the example). Hub Jobs disable torchcodec and fall back to librosa if you use transformers.audio_utils.load_audio on file paths. The function is called once per audio sample inside the eval loop. Model size: expose your loaded model as a module-level model (or set NUM_PARAMS = <int>) so the leaderboard can report its parameter count; otherwise size is left blank. Put extra Python dependencies in the requirements box above. Use the setup script for git clone / weight download (runs once before evaluation). No maintainer recipe is needed for a new model: if your evaluate() imports a cloned package (e.g. from MyASR.model import MyASR), the worker auto-discovers it under /tmp. For full control, export paths from your setup script via echo "PYTHONPATH=/tmp/MyASR/src" >> "$FFASR_ENV_FILE" (also accepts FFASR_IMPORT_PATHS and any KEY=VALUE). See Mega-ASR recipe for an example. Custom evaluators run on a Hub Job only after a moderator approves them.

This is a gated repo

Enter a model id (author/name) to open its Hub page.

Optional free-form notes for moderators

Contact email

Contact

Questions about submissions, gated repos, or evaluation issues? Email contact@treble.tech.

Next models to evaluate

Loading upcoming evaluations…

Scenario analysis

Charts support standard Plotly interaction (legend, zoom, pan, hover).

Leaderboard order is by Average WER (mean across checked scenario columns; lower is better).

Average WER ranks the top N models by mean WER across live scenarios (lower is better). Speed ranks the top N models by RTFx (audio sec / inference sec, higher is better). Bars are colored per company (e.g. all NVIDIA models share one color).

WER heatmap shows WER (%) by model and scenario for the top N models (sorted by Average WER); lower WER corresponds to greener cells.

Pareto Front plots Average WER on X and RTFx on Y (log scale). Models on the frontier are labeled and connected by the dashed blue line; other models render as faint dots.

WER by scenario compares raw WER across conditions for the selected top models.

Top‑N models (WER bars, heatmap & grouped WER)

5 80

Scenarios shown in the heatmap

Near Field Speech: Near Field Speech (close-mic) Lab Measured: Lab Measured Lab Simulated: Lab Simulated High SNR: Realistic conditions: High SNR Moving High SNR: Moving High SNR Mid SNR: Realistic conditions: Mid SNR Moving Mid SNR: Moving Mid SNR Low SNR: Realistic conditions: Low SNR Moving Low SNR: Moving Low SNR

Pareto Front: Average WER vs RTFx

Models on the Pareto frontier achieve the best trade-off between WER and speed (RTFx). Names are shown for frontier models; hover over other points to see their names.

Pareto Front: Average WER vs RTFx

Average WER

Speed

WER heatmap

WER by scenario

Near Field Speech

Example

0:00

High SNR

Example

0:00

Mid SNR

Example

0:00

Low SNR

Example

0:00

About FFASR

Far‑field speech recognition degrades under reverberation, noise, and moving‑source conditions. While speech enhancement and far‑field fine‑tuning can improve robustness, evaluating systems across many acoustic scenarios is difficult to do with measured data alone. Room acoustics simulation has proven to be a practical and scalable approach to train far‑field speech recognition systems [1, 2].

To quantify the sim‑to‑real gap, we measure three room conditions, yielding 48 source–receiver combinations, and simulate the same configurations. The small performance gap between the equivalent measured and simulated sets supports the validity of the other simulated conditions for evaluating far‑field performance.

Scenario columns (WER · lower is better)

The leaderboard reports WER on nine complementary conditions:

Near Field Speech: dry, anechoic speech with very little reverberation (clean baseline).
Lab Measured: measured RIRs from the Treble office room conditions (sim‑to‑real reference; hidden by default).
Lab Simulated: Treble‑simulated RIRs matching the measured office room conditions (hidden by default).
High / Mid / Low SNR: simulated far‑field mixtures grouped by SNR (high > 14 dB, mid 8–12 dB, low < 6 dB).
Moving Low / Mid / High SNR (beta): moving talker with varying geometry at different SNRs (hidden by default).

Lower WER is always better. Every submission uses the same private held‑out set, reference transcripts, and Whisper‑style text normalization (lowercase, punctuation/whitespace normalized).

Leaderboard order is by Average WER over the scenario columns currently checked in the column selector (lower is better). By default the aggregate averages the WER of Near Field Speech (dry), High SNR, Mid SNR, and Low SNR; unchecking or checking a column updates the recomputed average.

Leaderboard columns

Column	Meaning
Avg WER (%)	Mean WER (percent) over checked scenario columns (lower is better). Primary ranking key.
Near Field Speech	WER (%) on dry / anechoic speech.
Lab Measured	WER (%) on measured Treble office room conditions (hidden by default).
Lab Simulated	WER (%) on simulated Treble office room conditions (hidden by default).
High / Mid / Low SNR	WER (%) on simulated far‑field mixtures at the named SNR.
Moving Low / Mid / High SNR *	WER (%) on moving-source splits (beta).
RTFx	Audio seconds ÷ inference seconds at batch size 1. Higher is faster; >1 means faster than real time.
Params (B)	Trainable parameters (billions).

RTFx depends on the hardware this Space uses. Treat it as a relative comparison between submissions, not an absolute throughput number.

Dataset

All simulation datasets are created by convolving dry speech recordings with room impulse responses (RIRs) simulated with Treble Technologies' SDK. For the dry speech, 2000 original samples (~15 s each) are measured in an anechoic chamber. Recording original speech ensures no test‑set contamination for the evaluated models and that the signals carry very little reverberation.

For the RIRs, a hybrid wave‑based and geometrical‑acoustics approach simulates acoustic scenes in 14 fully‑furnished rooms spanning 20 to 470 m³, including bathrooms, living rooms, meeting rooms, offices, classrooms, and restaurant spaces. The simulation captures phenomena such as diffraction, scattering, interference, and modal behavior that simpler simulations miss. They are similar to the RIRs of the Treble10 dataset [3].

Each acoustic scene includes one target speaker and up to three noise sources from the AID dataset [4], always including a transient noise (e.g. coughing) and a continuous noise (e.g. HVAC). Pink microphone noise is also added to each scene for further realism. Noise conditions are grouped into low, mid, and high SNRs, computed after rendering from the convolved speech and noise signals.

The measured dataset is collected from a room in the Treble office under three considerably different acoustic conditions. (1) Piles of 40 × 40 cm concrete bricks placed around the room introduce geometric complexity, scattering, and diffraction. (2) The room is furnished with typical objects (a couch, an armchair, a shelf, and a table) mimicking a living room. (3) The room is treated with a varying number of acoustic absorbers, yielding three sub‑conditions with reverberation times of 0.3 s, 0.6 s, and 0.9 s. After the measurements, all room conditions are modeled within the Treble SDK to simulate matching synthetic RIRs.

In summary, the following datasets are prepared:

Dataset	Duration
Dry speech	8 hours
High / Mid / Low SNR	8 hours each
Moving sources	8 hours each (same config as high/mid/low SNR)
Treble office rooms (simulated and measured)	2 hours each

The datasets are privately hosted on the Hugging Face Hub to prevent test‑set contamination while participants prepare their solutions. Participants submit via this Hugging Face Space, specifying the model checkpoint, software requirements, and custom evaluation code (which can include speech enhancement). From the maintainer page, the challenge coordinators launch evaluation on a standardized hardware setup (important for latency comparisons). All models are evaluated on an NVIDIA L4 GPU using Hugging Face Jobs [8].

Participants may use the Treble10 datasets [3] or room‑acoustics simulators of their choice [5, 6, 7] for preparing and/or evaluating their solutions. External datasets and transfer learning are allowed. We encourage participants to clearly document their datasets and/or publish the source code that generates them to improve reproducibility.

How to submit

Open the Submit tab and paste a Hugging Face model id (e.g. openai/whisper-tiny).
Optionally add notes (gated repos, custom inference, eval caveats) and custom evaluation code.
Your request is queued and evaluated in the background (up to four parallel Hub Jobs when enabled). Refresh the Leaderboard tab after the job finishes.

Submissions are loaded with an automatic backend (SpeechBrain → Granite speech → transformers ASR pipeline → universal seq2seq loader → CTC). Gated Hub repos require a token with accepted license on the Space.

Evaluation method & metrics

Performance on each acoustic condition is measured with word error rate (WER). An aggregate score is computed by averaging the WER of dry speech, high SNR, mid SNR, and low SNR. We also compute the inverse real‑time factor (RTFx) of each system (seconds of audio inferred divided by compute time in seconds) at batch size 1 to quantify latency. The balance between performance and latency is of practical importance, as many far‑field systems may run as part of a real‑time interaction with users. A Pareto plot helps visualize this tradeoff (a single metric balancing Avg WER and RTFx, e.g. area under the curve, could also be considered).

Word Error Rate (WER)

WER is the fraction of reference words that are substituted, inserted, or deleted:

WER = (S + I + D) / N

Example: one substitution and one deletion in six reference words → WER ≈ 0.33 (33%).

WER	Typical quality
< 5%	Excellent
5–10%	Good / production‑ready
10–30%	Usable with post‑processing
> 30%	Poor to unreliable

See the Examples tab for Near Field Speech, High SNR, Mid SNR, and Low SNR audio clips plus a Treble scene screenshot.

Privacy

Only the model id, optional submitter notes, and produced scores are stored.
Custom evaluate() scripts run on Hub Jobs after moderator approval when moderation is enabled, not directly on this Space UI process.

References

Kim, Chanwoo, et al. "Generation of Large‑Scale Simulated Utterances in Virtual Rooms to Train Deep‑Neural Networks for Far‑Field Speech Recognition in Google Home." Interspeech, 2017.
Doire, Clément and Eric Bezzam. "Reproducing On‑Device Data Accurately for Private‑by‑Design Voice Control." 2023. https://tech-blog.sonos.com/posts/reproducing-on-device-data-accurately-for-private-by-design-voice-control/
Mullins, S.S., Götz, G., Bezzam, E., Zheng, S. and Nielsen, D.G., 2025. Treble10: A high‑quality dataset for far‑field speech recognition, dereverberation, and enhancement. arXiv preprint arXiv:2510.23141.
Götz, Philipp, et al. "AID: Open‑source anechoic interferer dataset." IEEE IWAENC, 2022.
Scheibler, Robin, Eric Bezzam, and Ivan Dokmanić. "Pyroomacoustics: A python package for audio room simulation and array processing algorithms." IEEE ICASSP, 2018.
Diaz‑Guerra, D., Miguel, A., and Beltran, J. R. "gpuRIR: A python library for room impulse response simulation with GPU acceleration." Multimed. Tools Appl., vol. 80, no. 4, pp. 5653–5671, 2021.
Treble SDK. https://www.treble.tech/software-development-kit
Hugging Face Jobs. https://huggingface.co/docs/huggingface_hub/en/guides/jobs

Programmatic submission

Use the Gradio client against this Space (see the Submit tab for the current API signature):

from gradio_client import Client

client = Client("treble-technologies/FFASR_Leaderboard-storage")
# Illustrative: pass all fields the Submit tab exposes.
result = client.predict(
    "openai/whisper-tiny",  # model id
    "",                      # optional notes
    "you@example.com",       # contact email (required)
    "",                      # extra requirements
    "",                      # setup script
    "",                      # custom evaluator
    "",                      # recipe id
    False,                     # gated repo
    api_name="/submit_model",
)
print(result)

FFASR

Submit a model for evaluation

Contact

Next models to evaluate

Moderator access

Current job progress

Check submission

Retry job

Approve job

Pending moderation

Recent job activity

Import result from bucket artifact

Scenario analysis

Pareto Front: Average WER vs RTFx

Near Field Speech

High SNR

Mid SNR

Low SNR

About FFASR

Scenario columns (WER · lower is better)

Leaderboard columns

Dataset

How to submit

Evaluation method & metrics

Word Error Rate (WER)

Privacy

References

Programmatic submission