`tract` python bindings¶

tract is a library for neural network inference. While PyTorch and TensorFlow deal with the much harder training problem, tract focuses on what happens once the model in trained.

tract ultimate goal is to use the model on end-user data (aka “running the model”) as efficiently as possible, in a variety of possible deployments, including some which are no completely mainstream : a lot of energy have been invested in making tract an efficient engine to run models on ARM single board computers.

API Reference¶

ONNX — load ONNX models
NNEF — load and save NNEF models
Inference model — partially typed model from ONNX
Model — fully typed model, central to the cooking pipeline
Facts and Dimensions — shape, type, and symbolic dimension information
Runtime, Runnable and State — runtimes (CPU, Metal, CUDA), execution, and stateful models
Tensor — tensor data

Getting started¶

Install tract library¶

pip install tract. Prebuilt wheels are provided for x86-64 Linux and Windows, x86-64 and arm64 for MacOS.

Downloading the model¶

First we need to obtain the model. We will download an ONNX-converted MobileNET 2.7 from the ONNX model zoo.

wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx.

Preprocessing an image¶

Then we need a sample image. You can use pretty much anything. If you lack inspiration, you can this picture of Grace Hopper.

wget https://s3.amazonaws.com/tract-ci-builds/tests/grace_hopper.jpg

We will be needing pillow to load the image and crop it.

pip install pillow

Now let’s start our python script. We will want to use tract, obviously, but we will also need PIL’s Image and numpy to put the data in the form MobileNet expects it.

#!/usr/bin/env python

import tract
import numpy
from PIL import Image

We want to load the image, crop it into its central square, then scale this square to be 224x224.

im = Image.open("grace_hopper.jpg")
if im.height > im.width:
    top_crop = int((im.height - im.width) / 2)
    im = im.crop((0, top_crop, im.width, top_crop + im.width))
else:
    left_crop = int((im.width - im.height) / 2)
    im = im.crop((left_crop, 0, left_crop + im_height, im.height))
im = im.resize((224, 224))
im = numpy.array(im)

At this stage, we obtain a 224x224x3 tensor of 8-bit positive integers. We need to transform these integers to floats and normalize them for MobileNet. At some point during this normalization, numpy decides to promote our tensor to double precision, but our model is single precison, so we are converting it again after the normalization.

im = (im.astype(float) / 255. - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
im = im.astype(numpy.single)

Finally, ONNX variant of Mobilenet expects its input in NCHW convention, and our data is in HWC. We need to move the C axis before H and W, then insert the N at the left.

im = numpy.moveaxis(im, 2, 0)
im = numpy.expand_dims(im, 0)

Loading the model¶

Loading a model is relatively simple. We need to instantiate the ONNX loader first, the we use it to load the model. Then we ask tract to optimize the model and get it ready to run.

model = tract.onnx().load("./mobilenetv2-7.onnx").into_model().into_runnable()

If we wanted to process several images, this would only have to be done once out of our image loop.

Running the model¶

tract run methods take a list of inputs and returns a list of outputs. Each input can be a numpy array. The outputs are tract’s own Tensor data type, which should be converted to numpy array.

outputs = model.run([im])
output = outputs[0].to_numpy()

Interpreting the result¶

If we print the output, what we get is a array of 1000 values. Each value is the score of our image on one of the 1000 categoris of ImageNet. What we want is to find the category with the highest score.

print(numpy.argmax(output))

If all goes according to plan, this should output the number 652. There is a copy of ImageNet categories at the following URL, with helpful line numbering.

https://github.com/sonos/tract/blob/main/examples/nnef-mobilenet-v2/imagenet_slim_labels.txt

And… 652 is “microphone”. Which is wrong. The trick is, the lines are numbered from 1, while our results starts at 0, plus the label list includes a “dummy” label first that should be ignored. So the right value is at the line 654: “military uniform”. If you looked at the picture before you noticed that Grace Hopper is in uniform on the picture, so it does make sense.

Running on GPU¶

The getting started example above runs on the CPU, which is the default runtime. On systems with an NVIDIA GPU, tract can leverage CUDA for accelerated inference. The only change is in how the model is prepared: instead of calling into_runnable(), use a CUDA runtime to prepare the model.

import tract

# load and type the model as before
model = tract.onnx().load("./mobilenetv2-7.onnx").into_model()

# prepare it for the CUDA runtime
cuda = tract.runtime_for_name("cuda")
runnable = cuda.prepare(model)

# run exactly as before
outputs = runnable.run([im])
output = outputs[0].to_numpy()

The Metal runtime works the same way on Apple Silicon Macs: just replace "cuda" with "metal".

Model cooking with `tract`¶

Over the years of tract development, it became clear that beside “training” and “running”, there was a third time in the life-cycle of a model. One of our contributors nicknamed it “model cooking” and the term stuck. This extra stage is about all what happens after the training and before running.

If training and Runtime are relatively easy to define, the model cooking gets a bit less obvious. It comes from the realisation that the training form (an ONNX or TensorFlow file or ste of files) of a model may is usually not the most convenient form for running it. Every time a device loads a model in ONNX form and transform it into a suitable form for runtime, it goes through the same series or more or less complicated operations, that can amount to several seconds of high-CPU usage for current models. When running the model on a device, this can have several negative impact on experience: the device will take time to start-up, consume a lot of battery energy to get ready, maybe fight over CPU availability with other processes trying to get ready at the same instant on the device.

As this sequence of operations is generally the same, it becomes relevant to persist the model resulting of the transformation. It could be persisted at the first application start-up for instance. But it could also be “prepared”, or “cooked” before distribution to the devices.

Cooking to NNEF¶

tract supports NNEF. It can read a NNEF neural network and run it. But it can also dump its preferred representation of a model in NNEF.

At this stage, a possible path to production for a neural model becomes can be drawn:

model is trained, typically on big servers on the cloud, and exported to ONNX.
model is cooked, simplified, using tract command line or python bindings.
model is shipped to devices or servers in charge of running it.

Testing and benching models early¶

As soon as the model is in ONNX form, tract can load and run it. It gives opportunities to validate and test on the training system, asserting early on that tract will compute at runtime the same result than what the training model predicts, limiting the risk of late-minute surprise.

But tract command line can also be used to bench and profile an ONNX model on the target system answering very early the “will the device be fast enough” question. The nature of neural network is such that in many cases an untrained model, or a poorly trained one will perform the same computations than the final model, so it may be possible to bench the model for on-device efficiency before going through a costly and long model training.

tract-opl¶

NNEF is a pretty little standard. But we needed to go beyond it and we extended it in several ways. For instance, NNEF does not provide syntax for recurring neural network (LSTM and friends), which are an absolute must in signal and voice processing. tract also supports symbolic dimensions, which are useful to represent a late bound batch dimension (if you don’t know in advance how many inputs will have to be computed concurrently).

Pulsing¶

For interactive applications where time plays a role (voice, signal, …), tract can automatically transform batch models, to equivalent streaming models suitable for runtime. While batch models are presented at training time the whole signal in one go, a streaming model received the signal by “pulse” and produces step by step the same output that the batching model.

It does not work for every model, tract can obviously not generate a model where the output at a time depends on input not received yet. Of course, models have to be causal to be pulsable. For instance, a bi-directional LSTM is not pulsable. Most convolution nets can be made causal at designe time by padding, or at cooking time by adding fixed delays.

This cooking step is a recurring annoyance in the real-time voice and signal field : it can be done manually, but is very easy to get wrong. tract makes it automactic.

tract python bindings¶