Model quantization and compilation using TFLite and TVM

Let’s consider a simple scenario where we want to optimize the inference time and memory usage of a classification model executed on a x86 CPU.

To do this, we are going to use the PetDataset Dataset and the TensorFlowPetDatasetMobileNetV2 ModelWrapper.

Prepare environment

First of all, we need to install all necessary dependencies:

pip install "kenning[tensorflow,tflite,tvm,reports] @ git+https://github.com/antmicro/kenning.git"

Train the model (optional)

We will skip the training process, we will use tensorflow_pet_dataset_mobilenetv2.h5. In Kenning, available models and resources can be downloaded using URIs with the kenning:// scheme, in this case kenning:///models/classification/tensorflow_pet_dataset_mobilenetv2.h5.

The training of the above model can be performed using the following command:

kenning train \
    --modelwrapper-cls kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2 \
    --dataset-cls kenning.datasets.pet_dataset.PetDataset \
    --logdir build/logs \
    --dataset-root build/pet-dataset \
    --model-path build/trained-model.h5 \
    --batch-size 32 \
    --learning-rate 0.0001 \
    --num-epochs 50

Benchmarking a model using a native framework

First, we want to check how the trained model performs using the native framework on CPU.

For this, we will use the kenning test tool. The tool is configured with JSON files (scenarios).

In our case, the JSON file (named native.json) will look like this:

Listing 6 mobilenetv2-tensorflow-native.json
{
  "model_wrapper": {
    "type": "kenning.modelwrappers.classification.tensorflow_pet_dataset.TensorFlowPetDatasetMobileNetV2",
    "parameters": {
      "model_name": "native",
      "model_path": "kenning:///models/classification/tensorflow_pet_dataset_mobilenetv2.h5"
    }
  },
  "dataset": {
    "type": "kenning.datasets.pet_dataset.PetDataset",
    "parameters": {
      "dataset_root": "./build/PetDataset"
    }
  }
}

This JSON provides a configuration for running the model natively and evaluating it against a defined Dataset.

For every class in the JSON file above, two keys are required: type which is a module path of our class and parameters which is used to provide arguments used to create instances of our classes.

In model_wrapper, we specify the model used for evaluation - here it is MobileNetV2 trained on PetDataset. The model_path is the path to the saved model. The TensorFlowPetDatasetMobileNetV2 model wrapper provides methods for loading the model, preprocessing the inputs, postprocessing the outputs and running inference using the native framework (TensorFlow in this case).

The dataset provided for evaluation is PetDataset - here we specify that we want to download the dataset to the ./build/pet-dataset directory (dataset_root). The PetDataset class can download the dataset (if necessary), load it, read the inputs and outputs from files, process them, and implement evaluation methods for the model.

With the config above saved in the native.json file, run the kenning test scenario:

kenning test \
  --json-cfg mobilenetv2-tensorflow-native.json \
  --measurements build/native.json

This tool runs inference based on the given configuration, evaluates the model and stores quality and performance metrics in JSON format, saved to the build/native.json file. All other JSONs in this example use case can be executed with this command.

To visualize the evaluation and benchmark results, run the kenning report tool:

kenning report \
  --report-path build/benchmarks/native.md \
  --measurements build/native.json \
  --root-dir build/benchmarks \
  --img-dir build/benchmarks/img \
  --report-types performance classification \
  --report-name 'native'

The kenning report tool takes the output JSON file generated by the kenning test tool, and creates a report titled native, which is saved in the build/benchmarks/native.md directory. As specified in the --report-types flag, we create performance and classification metrics sections in the report (for example, there is also a detection report type for object detection tasks).

The build/benchmarks/img directory contains images with the native_* prefix visualizing the confusion matrix, CPU and memory usage, as well as inference time.

The build/benchmarks/native.md file is a Markdown document containing a full report for the model. The document links to the generated visualizations and provides aggregated information about CPU and memory usage, as well as classification quality metrics, such as accuracy, sensitivity, precision, or G-Mean. Such a file can be included in a larger, Sphinx-based documentation, which allows easy, automated report generation, using e.g. CI, like in the case of the Kenning documentation.

We can also use a simplified command:

kenning report \
  --report-path build/benchmarks/native.md \
  --measurements build/native.json

The available --report-types will be derived from measurements in the build/native.json file.

Conveniently, kenning test and kenning report commands can be reduced to a single kenning run:

kenning test report \
  --json-cfg mobilenetv2-tensorflow-native.json \
  --measurements build/native.json \
  --report-path build/benchmarks/native.md

While native frameworks are great for training and inference, model design, training on GPUs and distributing training across many devices, e.g. in a cloud environment, there is a fairly large variety of inference-focused frameworks for production purposes that focus on getting the most out of hardware in order to get results as fast as possible.

Optimizing a model using TensorFlow Lite

TensorFlow Lite is one of such frameworks. It is a lightweight library for inferring networks on edge - it has a small binary size (which can be reduced further by disabling unused operators) and a highly optimized format of input models, called FlatBuffers.

Before the TensorFlow Lite Interpreter (runtime for the TensorFlow Lite library) can be used, the model first needs to be optimized and compiled to the .tflite format.

Let’s add a TensorFlow Lite Optimizer that will convert our MobileNetV2 model to a FlatBuffer format, as well as a TensorFlow Lite Runtime that will execute the model:

Listing 7 mobilenetv2-tensorflow-tflite-f32.json
  "optimizers": [
    {
      "type": "kenning.optimizers.tflite.TFLiteCompiler",
      "parameters": {
        "target": "default",
        "compiled_model_path": "./build/fp32.tflite",
        "inference_input_type": "float32",
        "inference_output_type": "float32"
      }
    }

The configuration of the already existing blocks does not change, and the dataset will not be downloaded again since the files are already present.

The first new addition in comparison to the previous flow is the presence of the optimizers list, which allows us to add one or more objects inheriting from the kenning.core.optimizer.Optimizer class. Optimizers read the model from the input file, apply various optimizations, and then save the optimized model to a new file.

In our current scenario, we will use the TFLiteCompiler class - it reads the model in a Keras-specific format, optimizes the model and saves it to the ./build/fp32.tflite file. The parameters of this particular Optimizer are worth noting here (each Optimizer usually has a different set of parameters):

  • target - indicates what the desired target device (or model type) is, regular CPU is default. Another example here could be edgetpu, which can compile models for the Google Coral platform.

  • compiled_model_path - indicates where the model should be saved.

  • inference_input_type and inference_output_type - indicate what the input and output type of the model should be. Usually, all trained models use FP32 weights (32-bit floating point) and activations - using float32 here keeps the weights unchanged.

Another addition is the runtime block which provides a class inheriting from the kenning.core.runtime.Runtime class that is able to load the final model and run inference on target hardware. Usually, each Optimizer has a corresponding Runtime capable of running its results.

To compile the scenario (called tflite-fp32.json), run:

kenning optimize test report \
  --json-cfg mobilenetv2-tensorflow-tflite-f32.json \
  --measurements build/tflite-fp32.json \
  --report-path build/benchmarks/tflite-fp32.md

While it depends on the platform used, you should be able to see a significant improvement in both inference time (model ca. 10-15x faster model compared to the native model) and memory usage (output model ca. 2x smaller). What’s worth noting is that we get a significant improvement with no harm to the quality of the model - the outputs stay the same.

Confusion matrix

Quantizing a model using TensorFlow Lite

To further reduce memory usage, we can quantize the model - it is a process where all weights and activations in a model are calibrated to work with low-precision floating-point or integer reporesentations (i.e. INT8), instead of the FP32 precision. While it may severely harm prediction quality, the quality reduction can be negligible with proper calibration.

The model can be quantized during the compilation process in TensorFlow Lite. With Kenning, it can be achieved with the following simple additions:

Listing 8 mobilenetv2-tensorflow-tflite-int8.json
  "optimizers": [
    {
      "type": "kenning.optimizers.tflite.TFLiteCompiler",
      "parameters": {
        "target": "int8",
        "compiled_model_path": "./build/int8.tflite",
        "inference_input_type": "int8",
        "inference_output_type": "int8"
      }
    }

The only changes here in comparison to the previous configuration appear in the TFLiteCompiler configuration. We change target, inference_input_type and inference_output_type to int8. Then, in the background, TFLiteCompiler fetches a subset of images from the PetDataset object to calibrate the model and the entire model calibration process happens automatically.

Let’s run the scenario above (tflite-int8.json):

kenning optimize test report \
  --json-cfg mobilenetv2-tensorflow-tflite-int8.json \
  --measurements build/tflite-int8.json \
  --report-path build/benchmarks/tflite-int8.md

This results in a model over 7 times smaller compared to the native model without significant loss of accuracy (but without speed improvement).

Speeding up inference with Apache TVM

To speed up inference of a quantized model, we can utilize vector extensions in x86 CPUs, more specifically AVX2. To do this, we can use the Apache TVM framework to compile efficient runtimes for various hardware platforms. The scenario looks like this:

Listing 9 mobilenetv2-tensorflow-tvm-avx-int8.json
  "optimizers": [
    {
      "type": "kenning.optimizers.tflite.TFLiteCompiler",
      "parameters": {
        "target": "int8",
        "compiled_model_path": "./build/int8.tflite",
        "inference_input_type": "int8",
        "inference_output_type": "int8"
      }
    },
    {
      "type": "kenning.optimizers.tvm.TVMCompiler",
      "parameters": {
        "target": "llvm -mcpu=core-avx2",
        "opt_level": 3,
        "conv2d_data_layout": "NCHW",
        "compiled_model_path": "./build/int8_tvm.tar"
      }
    }

As visible, adding a new framework is just a matter of simply adding and configuring another optimizer and using a Runtime corresponding to the final Optimizer.

The TVMCompiler, with llvm -mcpu=core-avx2 as the target, optimizes and compiles the model to use vector extensions. The final result is a .tar file containing a shared library that implements the entire model.

Let’s compile the scenario (tvm-avx2-int8.json):

kenning optimize test report \
  --json-cfg mobilenetv2-tensorflow-tvm-avx-int8.json \
  --measurements build/tvm-avx2-int8.json \
  --report-path build/benchmarks/tvm-avx2-int8.md

This results in a model over 40 times faster compared to the native implementation, with a 3x reduction in size.

This demonstrates how easily we can interconnect various frameworks and get the most out of hardware using Kenning, while performing just minor alterations to the configuration file.

The summary of passes looks as follows:

Speed boost

Accuracy

Size reduction

native

1

0.9572730984

1

tflite-fp32

15.79405698

0.9572730984

1.965973551

tflite-int8

1.683232669

0.9519662539

7.02033412

tvm-avx2-int8

41.61514549

0.9487005035

3.229375069

Automated model comparison

The kenning report tool also allows us to compare evaluation results for multiple models. Apart from creating a model summary table, it also creates plots aggregating measurements collected during the evaluation process.

To create a comparison report for the above experiments, run:

kenning report \
    --report-path build/benchmarks/summary.md \
    --measurements \
        build/native.json \
        build/tflite-fp32.json \
        build/tflite-int8.json \
        build/tvm-avx2-int8.json

Some examples of comparisons between various models rendered with the script:

  • Accuracy, inference time and model size comparison:

  • Resource utilization distribution:

  • Comparison of classification metrics:

  • And more


Last update: 2024-12-06