10. Nemo ASR support

Goals

By the end of this guide, you will know how to:

Export a NeMo ASR model to NNEF using t2n_export_nemo
Run WAV inference from a minimal Rust binary
Run inference from Python using tract
Evaluate the exported model using Word Error Rate (WER)

Prerequisite

Basic Python knowledge
Basic Rust knowledge
Approximately 10 minutes to read this page

Overview

This page documents the end-to-end workflow for exporting an NVIDIA NeMo Automatic Speech Recognition (ASR) model to NNEF using torch-to-nnef, running inference with tract, and evaluating the exported model against standard ASR benchmarks.

Export a NeMo ASR model

The t2n_export_nemo command loads a pre-trained ASR model from the NeMo toolkit and exports it to the NNEF format.

If not already installed, install torch_to_nnef with the nemo-tract extra. This enables the NeMo-specific export command:

t2n_export_nemo \
    -e ./dump_parakeet_v3_06B \ # export directory
    --tract-specific-path $HOME/SONOS/src/tract/target/release/tract \ # optional path to tract binary
    -tt very # numerical tolerance for NeMo vs tract checks

# -s nvidia/parakeet-tdt-0.6b-v3 \ # optional explicit model slug
# -p ~/user/finetuned-parakeet.nemo \ # optional explicit path to .nemo file
# --compress-method min_max_q4_0_all # optional model compression

Since in this example no -s argument is provided, the command defaults to listing the known 'nemo' compatible models on HuggingFace Hub and Nemo registeries (we mostly tested parakeet and nemotron).

After the command completes, the export directory (e.g. ./dump_parakeet_v3_06B) will contain:

The exported NNEF model files
A model_config.json file describing the exported pipeline
A export_config.json with all export options used
A .log file with export details

Additional export options are available via:

t2n_export_nemo --help

Some NeMo preprocessing components are not yet fully supported by tract. In such cases, options such as --skip-preprocessor can be used to exclude those stages from the export.

CLI flags quick reference

-e, --export-dir: Output directory (must not pre-exist).
-s, --model-slug: Explicit NeMo model slug; omit to choose interactively.
-p, --model-path: Explicit local path to .nemo file.
--tract-specific-version / --tract-specific-path: Select tract version or binary.
--tract-reify-sdpa: Enable SDPA reification where supported by selected tract.
-tt, --tract-check-io-tolerance: IO check strictness (exact, approximate, loose, or skip).
--skip-preprocessor: Export only encoder/decoder/joint parts.
--split-joint-decoder: Split decoder and joint into separate subnets.
--compress-registry / --compress-method: Apply weight compression during export.

Run t2n_export_nemo --help for the full list of options.

Shape configuration (boundary remodeler)

See also: the dedicated remodeler tutorial for broader, provider-agnostic usage and API details

In many cases you will want to control the symbolic shapes and boundary transforms used during export (e.g., set a stable BATCH symbol, collapse size-1 dims, bind a scalar to a dynamic size, or keep only a subset of outputs). You can manage this via a YAML shape config file passed to the CLI.

Generate a starting template aligned to your model with:

t2n_export_nemo \
  --inspect-signatures \
  --dump-shape-config ./shapes.yaml \
  # ... your usual flags (model slug/path, etc.)

The generated shapes.yaml uses a nested layout per subnet:

inputs: mapping of input-name -> settings
outputs (optional): mapping of output-name -> settings
renamed_symbols (optional): { TARGET: [SOURCES...] } aliasing of dynamic symbols
outputs_keep (always present in the template): ordered list of output names to keep (default if omitted: keep all)
extensions (optional): list of custom extension strings (e.g., tract_assert constraints for pulsification). For known pretrained models, these are auto-populated from a built-in registry

Per-input settings under inputs:

original_shape: list of dims (ints or strings)
collapse_dims (optional): list of symbols to collapse at the boundary
bind_scalar_to_dim_size (optional): dynamic source as subnet.input.SYMBOL
eval_symbols (optional): { SYMBOL: int_value } -- pin dynamic symbols to concrete sizes in test inputs during export (e.g., {TARGETS__TIME: 1} for single-step decoding)

Per-output settings under outputs:

collapse_dims (optional): list of axis indices to squeeze from the output tensor (e.g., [0] to remove the batch axis)

Example (abbreviated):

encoder:
  inputs:
    audio_signal:
      original_shape: [AUDIO_SIGNAL__BATCH, 128, AUDIO_SIGNAL__TIME]
      collapse_dims: [AUDIO_SIGNAL__BATCH]
    length:
      original_shape: [LENGTH__BATCH]
      collapse_dims: [LENGTH__BATCH]
      bind_scalar_to_dim_size: encoder.audio_signal.AUDIO_SIGNAL__TIME
  outputs:
    outputs:
      collapse_dims: [0]

decoder_joint:
  inputs:
    encoder_outputs:
      original_shape: [ENCODER_OUTPUTS__BATCH, 1024, ENCODER_OUTPUTS__TIME]
      collapse_dims: [ENCODER_OUTPUTS__BATCH, ENCODER_OUTPUTS__TIME]

decoder:
  renamed_symbols: { BATCH: [TARGETS__BATCH, STATES_0__BATCH, STATES_1__BATCH] }
  # Typical RNNT decoder outputs include: outputs, prednet_lengths, states_out
  # Keep only the ones you need (e.g., drop prednet_lengths)
  outputs_keep: [outputs, states_out]
  inputs:
    targets:
      original_shape: [TARGETS__BATCH, TARGETS__TIME]
      collapse_dims: [BATCH]
    states_0:
      original_shape: [2, STATES_0__BATCH, 640]
      collapse_dims: [BATCH]
    states_1:
      original_shape: [2, STATES_1__BATCH, 640]
      collapse_dims: [BATCH]

Decoder: dropping prednet_lengths while keeping IO aligned

When you exclude prednet_lengths from decoder outputs via outputs_keep, also bind the target_length input to the TIME dimension of targets so it becomes an internal scalar (and is no longer exposed as an external input):

decoder:
  outputs_keep: [outputs, states_out]
  inputs:
    targets:
      original_shape: [TARGETS__BATCH, TARGETS__TIME]
      collapse_dims: []
    target_length:
      original_shape: [TARGET_LENGTH__BATCH]
      collapse_dims: []
      bind_scalar_to_dim_size: decoder.targets.TARGETS__TIME
    states_0:
      original_shape: [2, STATES_0__BATCH, 640]
      collapse_dims: [BATCH]
    states_1:
      original_shape: [2, STATES_1__BATCH, 640]
      collapse_dims: [BATCH]

This keeps the external input/output quantities consistent and makes the boundary contract explicit: target_length = size(targets, TIME).

Notes:

Use namespaced symbols: batch axes appear as INPUT__BATCH per input.
To expose a common tract-facing name (e.g., BATCH) across inputs, declare it via renamed_symbols.
Aliases listed in renamed_symbols are accepted anywhere a symbol is referenced (collapse/bind).
renamed_symbols targets cannot include themselves in sources.
collapse_dims (inputs) requires the symbol to be dynamic on that input at the selected stage.
collapse_dims (outputs) takes axis indices (integers), not symbols. Only axes of size 1 are squeezed.
bind_scalar_to_dim_size binds a dynamic size as an int64 scalar.
outputs_keep filters exported outputs; order follows the subnet’s original output_names. The template always includes it so you can easily trim.
When batch collapse is detected on inputs, the NeMo registry auto-populates outputs.collapse_dims: [0] for all outputs of that subnet. Explicit config takes precedence.

Boundary semantics

Inputs that are Python tuples in the module API are flattened at the boundary (e.g., RNNT states → states_0, states_1).
collapse_dims removes listed dynamic axes externally and reinserts them internally so inner modules see their expected rank.
bind_scalar_to_dim_size removes the bound input from the external IO and injects shape(source)[axis] as a dynamic int64 tensor.
renamed_symbols only affects the tract-facing dynamic axes; inspector views remain namespaced by input (e.g., TARGETS__BATCH).

Quick commands

See also: the provider‑agnostic remodeler tutorial for programmatic usage and richer inspection:

Inspect with config applied (human-rich):

t2n_export_nemo \
  --model-slug nvidia/parakeet-tdt-0.6b-v3 \
  --export-dir ./noop \
  --inspect-signatures \
  --inspect-stage final \
  --inspect-format human-rich \
  --shape-config shapes.yaml \
  --dry-run \
  --split-joint-decoder

Export with config:

t2n_export_nemo \
  --model-slug nvidia/parakeet-tdt-0.6b-v3 \
  --export-dir ./export_with_shapes \
  --shape-config shapes.yaml \
  --split-joint-decoder

Audio preprocessing requirements

All supported NeMo ASR models expect audio input with the following characteristics:

16 kHz sample rate
Mono channel
WAV format

Ensure that all input audio conforms to these requirements before running inference.

next sections are limited to RNNT and TDT models.

Due to limited time and resources, the following sections focus on RNNT and TDT models. Others are not guaranteed to work as is, but contributions are welcome!

Example: Running a NeMo ASR model with tract

in this example directory The example uses a pre-trained ASR model from NVIDIA NeMo and shows how to perform inference using the exported NNEF artifacts.

Run the exported model in Rust

To run the exported NeMo ASR model from Rust, add the tract-nemo crate to your Cargo.toml:

[dependencies]
tract-nemo = {
  git = "https://github.com/sonos/torch-to-nnef.git",
  branch = "main",
  subdir = "docs/examples/nemo_asr/"
}

Rust inference example

use tract_nemo::nemo_asr::NemoAsrModel;

fn main() -> tract_nemo::TractResult<()> {
    // Load the exported NeMo ASR model
    let model_path = "./dump_parakeet_v3_06B";
    let mut asr_model = NemoAsrModel::load(model_path)?;

    let input_wavs = vec![
        // paths to input WAV files
    ];

    // Run inference
    let transcripts = asr_model.infer_from_wav_paths(&input_wavs)?;

    // Display results
    for (i, t) in transcripts.iter().enumerate() {
        println!("Transcription[{}]: '{}'", i, t.text);

        // Each transcript also contains detailed items:
        // - token
        // - logit
        // - emitted_at_encoder_timestep
        // - emitted_at_encoder_timestep_iteration
    }

    Ok(())
}

Run the exported model in Python

The exported NeMo ASR model can also be executed from Python using the tract-nemo Python bindings.

First, install the Python package:

pip install "git+https://github.com/sonos/torch-to-nnef.git@main#egg=nemo-asr-tract&subdirectory=docs/examples/nemo_asr/src/nemo_asr_py"

Python inference example

import nemo_asr_tract

def main():
    # Load the exported NeMo ASR model
    model_path = "./dump_parakeet_v3_06B"
    asr_model = nemo_asr_tract.nemo_asr.NemoAsrModel.load(model_path)

    input_wavs = [
        "path/to/your/input1.wav",
        "path/to/your/input2.wav",
    ]

    # Run inference
    transcripts = asr_model.infer_from_wav_paths(input_wavs)

    # Display results
    for i, t in enumerate(transcripts):
        print(f"Transcription[{i}]: '{t.text}'")
        print(f"Items[{i}]: {t.items}")

if __name__ == "__main__":
    main()

Evaluation

If not already installed you need to setup the same python package, as the one for running tract model, with the eval extra for evaluation:

pip install "git+https://github.com/sonos/torch-to-nnef.git@main#egg=nemo-asr-tract[eval]&subdirectory=docs/examples/nemo_asr/src/nemo_asr_py"

The Python tooling also supports evaluation of the exported model using standard ASR benchmarks and WER metrics.

Run an ASR Open Leaderboard evaluation

nemo_tract_eval \
    -e ./dump_parakeet_v3_06B \
    -r ~/SONOS/data/test_asr_export_parakeet \
    --device 0

This command runs an evaluation following the same protocol as the Hugging Face ASR Open Leaderboard.

It produces, for each dataset:

.jsonl manifest files containing predictions and references
Per-dataset WER scores
Aggregated summary metrics

Use --help to inspect all available evaluation options.

Display sample-level differences between runners

nemo_tract_eval_compare_manifest \
    --results-dir ./../my-results-dir/ \
    --max-items 5

This command displays side-by-side comparisons (by default, NeMo vs tract) for a subset of samples, sorted by absolute WER difference.

Recompute scores and display a summary table

nemo_tract_eval_score_manifest ./../my-results-dir/

This recomputes WER scores from the generated manifest files and prints a summary table. This is useful when experimenting with alternative scoring logic.

Custom runner support

For more advanced use cases, the evaluation framework supports custom runners and datasets.

To define a new runner or model, inherit from the base class and implement the required methods:

from nemo_asr_tract.eval.runner import AsRRunner

class MyCustomRunner(AsRRunner):
    def __init__(self, model: str, device: int = 0):
        super().__init__(model, device)

    def name(self) -> str:
        my_super_model_and_runner_name = "dummy"
        return clean_name(my_super_model_and_runner_name)

    @classmethod
    def load_from_path(
        cls,
        *,
        cfg: EvalConfig,
        device: torch.device,
        dtype: torch.dtype,
    ) -> "AsrRunner":
        """Load the ASR runner from a model directory."""
        return cls(model, batch_size=cfg.batch_size)

    def transcribe_from_wav_paths(self, wav_paths: List[str]):
        return []

The custom runner can then be selected via the --model_runner_class argument in the evaluation CLI.

Tracking runner issues

In the past we have observed some issues with the exported models, such as mismatches between NeMo and tract runner outputs, or unexpected WER scores. To help track and debug these issues, we maintain a script where we log any runner-related discrepancy when running on specific batch, with specific hardware target (due to Kernel precisions differences). Here is a sample usage (it needs extra eval to run properly).

nemo_tract_eval_batch_align_checker \
    --results-dir ./../my-results-dir/ \
    --output-file ./runner_issues_log.jsonl
    --model-dir ../../assets/model \
    --dataset librispeech \
    --split test.clean \
    --sample-idx 1000 \
    -o ~/SONOS/data/2026_02_05_debug_batched_metal \
    [--force-cpu]