2. NNEF archive composition

Goals

At the end of this tutorial you will know:

The basics of .nnef format
The reference Khronos specification

Prerequisite

Understanding what a neural network is
5 min to read this page

The NNEF generated by our getting-started (and all torch_to_nnef export) respects most of the format defined by the specification.

In fact this package uses their tools to perform the last serialization part of the export. That's why, most informations you may need to understand this format is available in their specification. This page provides a quick glimpse with emphasis on key elements/differences, to allow you to easily navigate it.

Inside the archive

As shown in the 1st tutorial, export step creates a ${MY_MODEL_NAME}.nnef.tgz archive. This is really just a classical tar archive with gzip compression option. You can in fact control the compression ratio with the compression_level: int parameter of the export function (0 meaning no compression). Compression allows to trade space on disk and network transfers against time to load the model during the first time (decompression).

Info

tract itself is agnostic to specific archive or 'compression' format and as long as you provide a graph.nnef text file and related .dat files it should be possible to run it through tract. In fact tract is even able to take the model from a bit stream given proper interface call.

Since it's just an archive you can just extract it:

mkdir vit_b_16_nnef_dir
cd vit_b_16_nnef_dir
tar -xvzf ../vit_b_16.nnef.tgz
ls -l

You should see a list of .dat binary files each corresponds to a specific 'parameter' tensor from the ViT neural network exported, and a graph.nnef that contains the textual representation of the graph.

graph.nnef

Let's start by looking at the graph.nnef.

graph.nnef

version 1.0;

extension tract_registry tract_core;

fragment tract_core_properties(
) -> (properties: (string, tensor<scalar>)[])
{
  properties = [
    ("tract_target_version", "0.21.13"),
    ("torch_to_nnef_version", "0.18.6"),
//...
    ("export_cmd", "getting_started.py")
  ];
}

fragment tract_gelu( x: tensor<scalar> ) -> ( y: tensor<scalar> )
{
    y = 0.5 * x * ( 1.0 + tract_core_erf(x * 0.7071067811865475));
}

# ...

graph network(input) -> (output)
{
    input = tract_core_external(shape = [1, 3, 224, 224], datum_type = 'f32');
    class_token = variable<scalar>(label = 'class_token', shape = [1, 1, 768]);
# ...
    output = linear(
      select0,
      heads_head_weight,
      heads_head_bias_aligned_rank_expanded
    );

}

First we observe the extensions, in our case some of the operators used later are specific to tract, so the tract registry called tract_core is needed. In tract there are various registry for different purposes: tract_core (all classical operators in tract), tract_transformers that holds some operators specific to transformers, tract_onnx (operators very specific happening in ONNX), tract_extra that is specific to peculiar operators such as exponential unit norm, tract_resource to load custom variables inside your graph... Those are added automatically by torch_to_nnef except if you use custom operators.

After that we see a set of fragments you can think of those as pure functions, for most of them (there are few exceptions like if there is scan operator but that is a good approximate). These fragments allow to compose graph to 'compile' into smaller reusable blocks. tract_gelu is interesting because it is replaced on the fly by gelu specific operator if it exists in selected registries and for your hardware.

Finally there is the network which is the main entry point that will describe the inference computations to perform from inputs to outputs by calling operators and fragments, and assigning the result in temporary variables.

tract_core_external is like external from NNEF original spec but with data type specification, allowing fine grained definition of what is expected.
variable<scalar> is the standard declaration to load a parameter tensor as described in NNEF spec. label value attribute directly matches with the ${label_value}.dat file that will be loaded.

Info

As of today graph is mostly 'flat' with exception of few fragments. torch.nn.Module structure is not maintained in the final NNEF, so there is repetitions in the control flow expressed here.

Nothing, prevent us to further factorize the graph in the future, beyond simplicity of this package codebase, as this graph representation is NOT the final optimized graph that tract will run but rather the blueprint.

.dat files

hexadecimal .dat repr

We follow the .dat format specification defined in the Khronos spec, including support for q8 quantization. We also leverage the flexibility left to define new formats, for example: Q4_0 .dat files in a format that is close to isolated GGML_TYPE_Q4_0, and can be exported to tract with our package seamlessly as explained in the quantization tutorial.