Why Use NNEF?

Wait, What Is NNEF?

NNEF stands for Neural Network Exchange Format.

Introduced in 2018—just a year after ONNX .

NNEF addresses the same core challenge as ONNX: providing a standardized way to exchange neural network models across different tools and frameworks.

It is specified by the Khronos Group, an open, non-profit consortium of around 170 member organizations, better known for defining major graphics and compute standards such as WebGL, OpenCL, and Vulkan.

Tools and Ecosystem

Beyond the specification itself, Khronos also provides several reference tools to enable partial model conversion (e.g., from TensorFlow or ONNX). However, these tools:

Do not support PyTorch directly,
And none offer the extensive support provided by this package.

Note

We leverage these Khronos tools for final serialization within torch_to_nnef (thanks Viktor Gyenes & al for their continued support on NNEF-tools).

NNEF Inference Support

As of today, the only inference engine (excluding full training frameworks) that natively supports NNEF as a first-class format is tract — the open-source neural inference engine developed by Sonos.

The Good: What Makes the NNEF Specification Appealing

Leverages Existing, Widely-Supported Containers

Stop reinventing the wheel—NNEF embraces common container systems. It's efficient, well-supported, and decouples data storage from model structure (think of video formats vs. codecs).
- Example: tar is totally fine—and if you want compression, just apply it.
- Prefer another container format? You're free to use it.
Efficient Tensor Storage

Each tensor is stored as a binary .dat blob.
- While .npy might seem more standard, .dat offers better extensibility.
- The format supports custom data types with a:
  
  4-byte code indicating the tensor's item-type (Up to 4.2 billion possible custom types!)
Readable Graph Structure

The main .nnef file represents the model graph in a simple, declarative, text-based format:
- No control flow complexity
- Easy to read and edit (e.g., jump to definitions in your favorite editor)
- Flexible and extensible—it's just text.
Separation of Quantization Logic

Quantization metadata lives in a separate .quant file:
- Defines variables, quantization functions, and parameters
- Supports advanced schemes (e.g., Q40 per-group) via custom data types
Textual Composition with Pure Functions

Neural-network are made of repetition of blocks (group of layers), the text format promotes reusability, avoids repetition, and enables a clean functional structure.

The Bad: Limitations of the NNEF Specification

No Reference Implementation or Test Suite

Only basic converters exist (TensorFlow/ONNX), and a rudimentary interpreter in PyTorch—nothing production-grade.
Image-Centric Design

The spec was initially tailored for image inference tasks, limiting its general applicability.
Static Tensor Shapes

No support for dynamic dimensions.
No Built-In Support for Recurrent Layers
Undefined or Poorly-Specified Data Types for activations
Stagnant Development

Last official update: v1.0.5 on 2022-02

NNEF Extensions in Tract

Supports Text and Signal Models

Through an extended operator set.
Dynamic Shape Support

Enabled by symbolic dimensions.
Advanced Data Type Handling

Fine-grained, low-level types are natively supported.
Modular Subgraph Assembly

Enables flexible architecture composition.

These extensions are encapsulated under the concept of inference targets in torch_to_nnef, allowing inference engines to define their own "NNEF flavor"—while retaining a shared syntax and graph structure and common set of 'specified' operators.

Why Not ONNX or Other Protocol Buffer-Based Formats?

Abstract

Let's be clear: ONNX is a great standard. It's mature, widely adopted, and works well for many neural network applications.

However, ONNX is based on Protocol Buffers, which introduce real limitations—even acknowledged in their own docs:

Not Suitable for Large Data Assets

... assume that entire messages can be loaded into memory at once and are not larger than an object graph. For data that exceeds a few megabytes, consider a different solution; when working with larger data, you may effectively end up with several copies of the data due to serialized copies, which can cause surprising spikes in memory usage.
Inefficient for Large Float Arrays

Protocol buffer messages are less than maximally efficient in both size and speed for many scientific and engineering uses that involve large, multi-dimensional arrays of floating point numbers ...
No Built-In Compression

Opinionated Grievances (Specific to NN Use Cases)

Tightly Coupled Graph & Tensors Want to patch a model with new PEFT weights or tweak a few parameters? Good luck—everything’s entangled.
Unreadable Without Specialized Tools Tools like TensorBoard or Netron are needed for visualization but difficult to read when more than 10 I/O tensors are linked to an operator (e.g having long residual connection deforms the graph visuals).
No Direct Tensor Access Requires full graph parsing and multi-hop traversal.
Quantization definition is not very flexible Especially for custom formats or precision below Q4.
Extensibility is Harder To add new data formats, you need change of protocol buffer spec, features like symbols definition in tract need to be defined ad-hoc. Adding plain text extensions is easier to do and read (at the cost of loosing code-gen ser/deser from protobuf). Prior PyTorch 2.0, adding custom ops (when it has no-equivalent chain of supported ops) is also tedious and partly unspecified.

Safetensors

Safetensors is essentially a secure, structured list of tensors stored in binary—plus minimal metadata.

Directly Loadable to Devices
Avoids Pickle Security Issues

🔍 But: Its benefits are tied to loading efficiency—not the format itself. It could just as well have been implemented using tar.

Major Drawback

No Computation Graph

Every model architecture must be re-implemented manually on top of the inference engine—error-prone and wasteful.
No Operator Fusion or Optimization Guidance

That burden falls entirely on the implementer, per model.

GGUF

GGUF is similar to .safetensors, but includes a lot of quantization format definitions.

Vast choices of Quantization formats

Especially the Q40 format, which we've borrowed in tract/torch_to_nnef.
Still No Graph Structure

Just like .safetensors, GGUF lacks a way to express model computation graphs.