5. Large Language Models Support

Goals

At the end of this tutorial you will know:

How to export causal Large Language Models
Current status of this library with regard to LLM

Prerequisite

PyTorch and Python basics
10 min to read this page

Since 2020, Large Language Models have gathered significant attention in the industry to the point where every product start to integrate them. tract have been polishing for this special networks since late 2023, and the inference engine is now competitive with state of the art on Apple Silicon and more recently on Nvidia's GPUs. In the industry most players use the transformers library and a lot of the HuggingFace ecosystem to specify their models in PyTorch. This make this library the most up to date source of Model architecture and pre-trained weights. To ease the export and experiments with such models torch_to_nnef (this library), has added a dedicated set of modules that we will now present to you.

In this part we will only present ability to export to the tract inference engine.

Exporting a transformers pre-trained model

If you only want to export a model already trained, available on huggingface hub and compatible with the transformers library like for example: meta-llama/Llama-3.2-1B-Instruct for chat or text generation purpose. There is no need for you to learn the api's of torch_to_nnef, we have a nice easy to use command line for you (once torch_to_nnef is installed):

torch_to_nnef LLM cli

t2n_export_llm_to_tract -e . --help

Which should output something like

usage: t2n_export_llm_to_tract [-h] -e EXPORT_DIRPATH [-s MODEL_SLUG] [-dt {f32,f16,bf16}] [-idt {f32,f16,bf16}] [-mp] [--compression-registry COMPRESSION_REGISTRY] [-d LOCAL_DIR]
                               [-f32-attn] [-f32-lin-acc] [-f32-norm] [--num-logits-to-keep NUM_LOGITS_TO_KEEP] [--device-map DEVICE_MAP] [-tt {exact,approximate,close,very,super,ultra}]
                               [-n {raw,natural_verbose,natural_verbose_camel,numeric}] [--tract-specific-path TRACT_SPECIFIC_PATH] [--tract-specific-version TRACT_SPECIFIC_VERSION] [-td]
                               [-dwtac] [-sgts SAMPLE_GENERATION_TOTAL_SIZE] [-iaed] [-nv] [-v]
                               [-c {min_max_q4_0,min_max_q4_0_with_embeddings,min_max_q4_0_with_embeddings_99,min_max_q4_0_all}]
...

Ok, there is a lot of options here, instead let's do a concrete export of the meta-llama/Llama-3.2-1B-Instruct we mentioned earlier:

t2n_export_llm_to_tract \
    -s "meta-llama/Llama-3.2-1B-Instruct" \
    -dt f16 \
    -e $HOME/llama32_1B_f16 \
    --dump-with-tokenizer-and-conf \
    --tract-check-io-tolerance ultra

On a modern laptop with HuggingFace model already cached locally it should take around 50 seconds to export to NNEF. Tips: if you have rich installed as dependency, logs will be displayed in color and more elegantly.

Here we export the llama 3.2 referenced from PyTorch where the model is mostly stored in float16 temporary activations in bfloat16 to tract where almost all will be in float16 (given our -dt request, excepted for normalization kept in f32), we also check conformance between tract and PyTorch on a generic text (in english) and observe in the line of the log that it match:

IO bit match between tract and PyTorch for ...

Looking at what we just exported we see in the folder just created $HOME/llama32_1B_f16:

[2.3G]  $HOME/llama32_1B_f16
├── [2.3G]  model
│   ├── [2.2K]  config.json
│   └── [2.3G]  model.nnef.tgz
├── [  78]  modes.json
├── [4.0M]  tests
│   ├── [838K]  export_io.npz
│   ├── [902K]  prompt_io.npz
│   ├── [1.1M]  prompt_with_past_io.npz
│   └── [1.2M]  text_generation_io.npz
└── [ 16M]  tokenizer
    ├── [3.7K]  chat_template.jinja
    ├── [ 296]  special_tokens_map.json
    ├── [ 49K]  tokenizer_config.json
    └── [ 16M]  tokenizer.json

The most important file being the NNEF dump of the model of 2.3Go.

If we look at the signature of generated model we should see something like this:

graph network(
    input_ids,
    in_cache_key_0, in_cache_value_0,
    ...,
    in_cache_key_15, in_cache_value_15)

-> (
    outputs,
    out_cache_key_0, out_cache_value_0,
    ...,
    out_cache_key_15, out_cache_value_15
)

To run such model you can for example use this crate of tract.

work in progress

This cli is still early stage, we intend to support embedding & classification in a near future, as well as other modalities model like Visual and Audio LM.

This same cli allows you to export a model that you would have fine-tuned yourself and saved with .save_pretrained by replacing the -s {HUGGING_FACE_SLUG} by a -d {MY_LOCAL_DIR_PATH_TO_TRANSFORMERS_MODEL_WEIGHTS}, if you did your finetuning with PEFT you can just add -mp to merge the PEFT weights before export (in-case this is your wish: this will allow faster inference but remove ability to have multiple 'PEFT finetuning' sharing same base exported model).

Quantize your model

Quantization of models is essential to get the best model on limited resource devices. It is also very simple to apply opt-in at export time with this command line:

--compression-registry that control the registry that contains the quantization methods available. It can be any dict from installed modules including modules from external packages (different from torch_to_nnef).
--compression-method that select the quantization method to apply, as a toy example you can export the linear layers of a model in Q40 (that means: 4bit symmetric quantization with a granularity per group of 32 elements, totaling 4.5bpw) with simple min_max_q4_0. If you wish to leverage best quantization techniques we recommend you to read our tutorial on Quantization and export to implement your own (SONOS has a closed source package doing just that).

Export a model that does not fit in RAM

You want to go big, but you find that renting an instance will hundreds of Go of RAM just to export a model is ridiculous ? We agree ! The CLI described upper provide a convenient solution if you have a descent SSD disk just add:

--device-map t2n_offload_disk

to your prior command like for example:

t2n_export_llm_to_tract \
    --device-map t2n_offload_disk \
    -s "Qwen/Qwen3-8B" \
    -dt f16 \
    -f32-attn \
    -e $HOME/qwen3_8B \
    --dump-with-tokenizer-and-conf \
    --tract-check-io-tolerance ultra

And pouf done. It will be a bit slower because SSD are slower than RAM but hey exporting Qwen3 8B in f16 takes around 4min for a 16Go stored model (this trade is fine for most big models). See our offloaded tensor tutorial to learn more about how to leverage this further (even in your PyTorch based apps).

Export a model from different library

As long as your model can be serialized into the torch.jit internal intermediate representation (which is the case of almost all neural-networks, whole or parts). This library should be able to do the heavy lifting of the translation to NNEF for you.

Here are few key considerations to take before starting to support a non transformers (here we are speaking of the package, not the other architectures like Mamba, RWKV, ...) language model:

How past states of your neural network is managed inside the library, is it like transformers an external design that pass as input and output of your neural network main module all states (like KV-cache) ?
If not is it easy to transform the library internal modeling to approach this architecture ?

If you can answer yes to one of those 2 questions congratulation, you should be able to easily adapt these transformers specific torch_to_nnef modules.

Else if state management is internal to specific modules you will likely need to write custom operator exporter to express those IO at export time or add specific operators in tract to manage it.

In all cases, prior tutorials should be able to help you toward your goal especially with regard to dynamic axes and basic api.

Community

If you release a custom LLM NNEF export for a different library than transformers based on torch_to_nnef Please reach to us we would love to hear your feedback 😊

Demo: LLM Poetry generator

Using the knowledge you acquired during this tutorial and a bit of extra for WASM in rust. We demo the use a minimal Large Language model named Smollm 125m running in your browser (total experiment is <100 mo download).

Note

This model is not trained by SONOS so generation accuracy is responsibility of original HuggingFace authors. Inference performance is descent, but little to no effort was made to make tract WASM efficient, this demo is for demonstration purpose.

Curious to read the code behind it ? Just look at our example directory here and this raw page content.