FFASR
FFASR benchmarks speech‑recognition models on audio representative of far‑field use (noisy rooms, reverberation, adverse SNRs), not only studio‑dry speech. Every model runs on the same held‑out set and the same text normalization, so numbers are directly comparable. The simulated room impulse responses are similar to those in the Treble10 dataset.
Each submission reports WER for nine scenarios (Near Field Speech, Lab Measured, Lab Simulated, High SNR, Mid SNR, Low SNR, and three Moving SNR splits) plus RTFx and parameter count. The leaderboard ranks models by Average WER over the scenario columns you have checked (lower is better). The Analysis tab visualises the Pareto Front of Average WER vs RTFx for WER/speed trade-offs.
Paste a Hugging Face model id in the Submit tab; scoring runs server‑side and evaluation audio is not exposed to submitters. The Analysis tab provides WER rankings, heatmaps, speed views, and scenario‑wise comparisons.
- Moving-source evaluations are a beta feature.
Submit a model for evaluation
Paste a Hugging Face model id. The server loads every checkpoint with an automatic backend, so Whisper, IBM Granite speech, Cohere Transcribe, efficient‑speech / custom Whisper variants, Wav2Vec2 / HuBERT CTC heads, and SpeechBrain ASR, plus most other ASR stacks on the Hub are supported without a per‑submission choice.
Submissions are queued; when Hub Jobs are enabled, up to four models evaluate in parallel. Otherwise jobs run one at a time on this Space. When moderation is enabled, new requests wait for approval before they run.
Pre-fills setup script, evaluate(), and deps for install-heavy stacks (e.g. Mega-ASR).
Edit the example below: define evaluate(file: Path) -> str to transcribe one WAV. Load your model once at module level (as in the Cohere example). Use soundfile (or numpy) to read each WAV (see the example). Hub Jobs disable torchcodec and fall back to librosa if you use transformers.audio_utils.load_audio on file paths. The function is called once per audio sample inside the eval loop. Model size: expose your loaded model as a module-level model (or set NUM_PARAMS = <int>) so the leaderboard can report its parameter count; otherwise size is left blank. Put extra Python dependencies in the requirements box above. Use the setup script for git clone / weight download (runs once before evaluation). No maintainer recipe is needed for a new model: if your evaluate() imports a cloned package (e.g. from MyASR.model import MyASR), the worker auto-discovers it under /tmp. For full control, export paths from your setup script via echo "PYTHONPATH=/tmp/MyASR/src" >> "$FFASR_ENV_FILE" (also accepts FFASR_IMPORT_PATHS and any KEY=VALUE). See Mega-ASR recipe for an example. Custom evaluators run on a Hub Job only after a moderator approves them.
Enter a model id (author/name) to open its Hub page.
Contact
Questions about submissions, gated repos, or evaluation issues? Email contact@treble.tech.
Next models to evaluate
Moderator access
When FFASR_MODERATION=1 is set, new submissions wait here until you approve them. Use per-row Check, Approve, Retry, and Remove buttons — popups open for submission review, dataset selection, and approval.
When FFASR_REMOTE_JOBS=1, use Open Hub Job logs while a job runs. The bucket results/remote_artifacts/ folder only gets a JSON file after a successful run.
Retry re-queues failed, done, or queued jobs (not while running). Uncheck datasets in the retry popup to run only selected packed splits; other leaderboard WER columns are left unchanged on success. A full re-run (all datasets selected) replaces the row when the model is already listed.
Current job progress
Moderator tools are locked. Enter the secret above and click Unlock.
Retry job
Approve job
Pending moderation
Recent job activity
Retry ALL re-queues every failed / done / queued job against the full dataset set (all packed conditions). Running, dispatching, and pending-moderation jobs are skipped. Successful re-runs replace the existing leaderboard row for each model.
Import result from bucket artifact
If a Hub Job finished but the leaderboard CSV was not updated, paste the artifact file name from results/remote_artifacts/ (e.g. a1b2c3d4.json) to merge WER/RTFx into the CSV.
Scenario analysis
Charts support standard Plotly interaction (legend, zoom, pan, hover).
Leaderboard order is by Average WER (mean across checked scenario columns; lower is better).
Average WER ranks the top N models by mean WER across live scenarios (lower is better). Speed ranks the top N models by RTFx (audio sec / inference sec, higher is better). Bars are colored per company (e.g. all NVIDIA models share one color).
WER heatmap shows WER (%) by model and scenario for the top N models (sorted by Average WER); lower WER corresponds to greener cells.
Pareto Front plots Average WER on X and RTFx on Y (log scale). Models on the frontier are labeled and connected by the dashed blue line; other models render as faint dots.
WER by scenario compares raw WER across conditions for the selected top models.
Pareto Front: Average WER vs RTFx
Models on the Pareto frontier achieve the best trade-off between WER and speed (RTFx). Names are shown for frontier models; hover over other points to see their names.

Near Field Speech
High SNR
Mid SNR
Low SNR
About FFASR
Far‑field speech recognition degrades under reverberation, noise, and moving‑source conditions. While speech enhancement and far‑field fine‑tuning can improve robustness, evaluating systems across many acoustic scenarios is difficult to do with measured data alone. Room acoustics simulation has proven to be a practical and scalable approach to train far‑field speech recognition systems [1, 2].
To quantify the sim‑to‑real gap, we measure three room conditions, yielding 48 source–receiver combinations, and simulate the same configurations. The small performance gap between the equivalent measured and simulated sets supports the validity of the other simulated conditions for evaluating far‑field performance.
Scenario columns (WER · lower is better)
The leaderboard reports WER on nine complementary conditions:
- Near Field Speech: dry, anechoic speech with very little reverberation (clean baseline).
- Lab Measured: measured RIRs from the Treble office room conditions (sim‑to‑real reference; hidden by default).
- Lab Simulated: Treble‑simulated RIRs matching the measured office room conditions (hidden by default).
- High / Mid / Low SNR: simulated far‑field mixtures grouped by SNR (high > 14 dB, mid 8–12 dB, low < 6 dB).
- Moving Low / Mid / High SNR (beta): moving talker with varying geometry at different SNRs (hidden by default).
Lower WER is always better. Every submission uses the same private held‑out set, reference transcripts, and Whisper‑style text normalization (lowercase, punctuation/whitespace normalized).
Leaderboard order is by Average WER over the scenario columns currently checked in the column selector (lower is better). By default the aggregate averages the WER of Near Field Speech (dry), High SNR, Mid SNR, and Low SNR; unchecking or checking a column updates the recomputed average.
Leaderboard columns
| Column | Meaning |
|---|---|
| Avg WER (%) | Mean WER (percent) over checked scenario columns (lower is better). Primary ranking key. |
| Near Field Speech | WER (%) on dry / anechoic speech. |
| Lab Measured | WER (%) on measured Treble office room conditions (hidden by default). |
| Lab Simulated | WER (%) on simulated Treble office room conditions (hidden by default). |
| High / Mid / Low SNR | WER (%) on simulated far‑field mixtures at the named SNR. |
| **Moving Low / Mid / High SNR *** | WER (%) on moving-source splits (beta). |
| RTFx | Audio seconds ÷ inference seconds at batch size 1. Higher is faster; >1 means faster than real time. |
| Params (B) | Trainable parameters (billions). |
RTFx depends on the hardware this Space uses. Treat it as a relative comparison between submissions, not an absolute throughput number.
Dataset
All simulation datasets are created by convolving dry speech recordings with room impulse responses (RIRs) simulated with Treble Technologies' SDK. For the dry speech, 2000 original samples (~15 s each) are measured in an anechoic chamber. Recording original speech ensures no test‑set contamination for the evaluated models and that the signals carry very little reverberation.
For the RIRs, a hybrid wave‑based and geometrical‑acoustics approach simulates acoustic scenes in 14 fully‑furnished rooms spanning 20 to 470 m³, including bathrooms, living rooms, meeting rooms, offices, classrooms, and restaurant spaces. The simulation captures phenomena such as diffraction, scattering, interference, and modal behavior that simpler simulations miss. They are similar to the RIRs of the Treble10 dataset [3].
Each acoustic scene includes one target speaker and up to three noise sources from the AID dataset [4], always including a transient noise (e.g. coughing) and a continuous noise (e.g. HVAC). Pink microphone noise is also added to each scene for further realism. Noise conditions are grouped into low, mid, and high SNRs, computed after rendering from the convolved speech and noise signals.
The measured dataset is collected from a room in the Treble office under three considerably different acoustic conditions. (1) Piles of 40 × 40 cm concrete bricks placed around the room introduce geometric complexity, scattering, and diffraction. (2) The room is furnished with typical objects (a couch, an armchair, a shelf, and a table) mimicking a living room. (3) The room is treated with a varying number of acoustic absorbers, yielding three sub‑conditions with reverberation times of 0.3 s, 0.6 s, and 0.9 s. After the measurements, all room conditions are modeled within the Treble SDK to simulate matching synthetic RIRs.
In summary, the following datasets are prepared:
| Dataset | Duration |
|---|---|
| Dry speech | 8 hours |
| High / Mid / Low SNR | 8 hours each |
| Moving sources | 8 hours each (same config as high/mid/low SNR) |
| Treble office rooms (simulated and measured) | 2 hours each |
The datasets are privately hosted on the Hugging Face Hub to prevent test‑set contamination while participants prepare their solutions. Participants submit via this Hugging Face Space, specifying the model checkpoint, software requirements, and custom evaluation code (which can include speech enhancement). From the maintainer page, the challenge coordinators launch evaluation on a standardized hardware setup (important for latency comparisons). All models are evaluated on an NVIDIA L4 GPU using Hugging Face Jobs [8].
Participants may use the Treble10 datasets [3] or room‑acoustics simulators of their choice [5, 6, 7] for preparing and/or evaluating their solutions. External datasets and transfer learning are allowed. We encourage participants to clearly document their datasets and/or publish the source code that generates them to improve reproducibility.
How to submit
- Open the Submit tab and paste a Hugging Face model id (e.g.
openai/whisper-tiny). - Optionally add notes (gated repos, custom inference, eval caveats) and custom evaluation code.
- Your request is queued and evaluated in the background (up to four parallel Hub Jobs when enabled). Refresh the Leaderboard tab after the job finishes.
Submissions are loaded with an automatic backend (SpeechBrain → Granite speech → transformers ASR pipeline →
universal seq2seq loader → CTC). Gated Hub repos require a token with accepted license on the Space.
Evaluation method & metrics
Performance on each acoustic condition is measured with word error rate (WER). An aggregate score is computed by averaging the WER of dry speech, high SNR, mid SNR, and low SNR. We also compute the inverse real‑time factor (RTFx) of each system (seconds of audio inferred divided by compute time in seconds) at batch size 1 to quantify latency. The balance between performance and latency is of practical importance, as many far‑field systems may run as part of a real‑time interaction with users. A Pareto plot helps visualize this tradeoff (a single metric balancing Avg WER and RTFx, e.g. area under the curve, could also be considered).
Word Error Rate (WER)
WER is the fraction of reference words that are substituted, inserted, or deleted:
WER = (S + I + D) / N
Example: one substitution and one deletion in six reference words → WER ≈ 0.33 (33%).
| WER | Typical quality |
|---|---|
| < 5% | Excellent |
| 5–10% | Good / production‑ready |
| 10–30% | Usable with post‑processing |
| > 30% | Poor to unreliable |
See the Examples tab for Near Field Speech, High SNR, Mid SNR, and Low SNR audio clips plus a Treble scene screenshot.
Privacy
- Only the model id, optional submitter notes, and produced scores are stored.
- Custom
evaluate()scripts run on Hub Jobs after moderator approval when moderation is enabled, not directly on this Space UI process.
References
- Kim, Chanwoo, et al. "Generation of Large‑Scale Simulated Utterances in Virtual Rooms to Train Deep‑Neural Networks for Far‑Field Speech Recognition in Google Home." Interspeech, 2017.
- Doire, Clément and Eric Bezzam. "Reproducing On‑Device Data Accurately for Private‑by‑Design Voice Control." 2023. https://tech-blog.sonos.com/posts/reproducing-on-device-data-accurately-for-private-by-design-voice-control/
- Mullins, S.S., Götz, G., Bezzam, E., Zheng, S. and Nielsen, D.G., 2025. Treble10: A high‑quality dataset for far‑field speech recognition, dereverberation, and enhancement. arXiv preprint arXiv:2510.23141.
- Götz, Philipp, et al. "AID: Open‑source anechoic interferer dataset." IEEE IWAENC, 2022.
- Scheibler, Robin, Eric Bezzam, and Ivan Dokmanić. "Pyroomacoustics: A python package for audio room simulation and array processing algorithms." IEEE ICASSP, 2018.
- Diaz‑Guerra, D., Miguel, A., and Beltran, J. R. "gpuRIR: A python library for room impulse response simulation with GPU acceleration." Multimed. Tools Appl., vol. 80, no. 4, pp. 5653–5671, 2021.
- Treble SDK. https://www.treble.tech/software-development-kit
- Hugging Face Jobs. https://huggingface.co/docs/huggingface_hub/en/guides/jobs
Programmatic submission
Use the Gradio client against this Space (see the Submit tab for the current API signature):
from gradio_client import Client
client = Client("treble-technologies/FFASR_Leaderboard-storage")
# Illustrative: pass all fields the Submit tab exposes.
result = client.predict(
"openai/whisper-tiny", # model id
"", # optional notes
"you@example.com", # contact email (required)
"", # extra requirements
"", # setup script
"", # custom evaluator
"", # recipe id
False, # gated repo
api_name="/submit_model",
)
print(result)