Model Formats#

As ML model applications increase, so too does the need for optimising the models for specific use-cases. To address performance-cost ratio and portability issues, there’s recently been a rise of competing model formats.

Table 9 Comparison of popular model formats#





Ease of Use

🟢 good

🟡 moderate

🟡 moderate

Integration with Deep Learning Frameworks

🟢 most

🟡 growing

🟡 growing

Deployment Tools

🟢 yes

🔴 no

🟢 yes


🟢 yes

🔴 no

🔴 no

Inference Boost

🟡 moderate

🟢 good

🟢 good

Quantisation Support

🟡 good

🟢 good

🟡 moderate

Custom Layer Support

🟢 yes

🔴 limited

🟢 yes


LF AI & Data Foundation




Is the table above outdated or missing an important model? Let us know in the comments below, or open a pull request!

Table 10 Model Formats Repository Statistics#


Commit Rate




Pull Requests






Based on the above stats, it looks like ggml is the most popular library currently, followed by onnx. Also one thing to note here is onnx repositories are around ~9x older compared to ggml repositories.

ONNX feels truly OSS, since it’s run by an OSS community, whereas both GGML and friends, TensorRT are run by Organisations (even though they are open source), and final decisions are made by a single (sometimes closed) entity which can finally affect on what kind of features that entity prefers or has biases towards even though both can have amazing communities at the same time.


ONNX (Open Neural Network Exchange) provides an open source format for AI models by defining an extensible computation graph model, as well as definitions of built-in operators and standard data types. It is widely supported and can be found in many frameworks, tools, and hardware enabling interoperability between different frameworks. ONNX is an intermediary representation of your model that lets you easily go from one environment to the next.

Features and Benefits#

Fig. 58

  • Model Interoperability: ONNX bridges AI frameworks, allowing seamless model transfer between them, eliminating the need for complex conversions.

  • Computation Graph Model: ONNX’s core is a graph model, representing AI models as directed graphs with nodes for operations, offering flexibility.

  • Standardised Data Types: ONNX establishes standard data types, ensuring consistency when exchanging models, reducing data type issues.

  • Built-in Operators: ONNX boasts a rich library of operators for common AI tasks, enabling consistent computation across frameworks.

  • ONNX Ecosystem:

    • microsoft/onnxruntime A high-performance inference engine for cross-platform ONNX models.

    • onnx/onnxmltools Tools for ONNX model conversion and compatibility with frameworks like TensorFlow and PyTorch.

    • onnx/models A repository of pre-trained models converted to ONNX format for various tasks.

    • Hub: Helps sharing and collaborating on ONNX models within the community.


Usability around ONNX is fairly developed and has lots of tooling support around it by the community, let’s see how we can directly export into onnx and make use of it.

Firstly the model needs to be converted to ONNX format using a relevant converter, for example if our model is created using Pytorch, for conversion we can use:

Once exported we can load, manipulate, and run ONNX models. Let’s take a Python example:

To install the official onnx python package:

pip install onnx

To load, manipulate, and run ONNX models in your Python applications:

import onnx

# Load an ONNX model
model = onnx.load("your_awesome_model.onnx")

# Perform inference with the model
# (Specific inference code depends on your application and framework)


Many frameworks/tools are supported, with many examples/tutorials at onnx/tutorials.

It has support for Inference runtime binding APIs written in few programming languages (python, rust, js, java, C#).

ONNX model’s inference depends on the platform which runtime library supports, called Execution Provider. Currently there are few ranging from CPU based, GPU based, IoT/edge based and few others. A full list can be found here.

Onnxruntime has few example tools that can be used to quantize select ONNX models. Support is currenty based on operators in the model. Read more here.

Also there are few visualisation tools support like lutzroeder/Netron and more for models converted to ONNX format, highly recommended for debugging purposes.


Currently ONNX is part of LF AI Foundation, conducts regular Steering committee meetings and community meetups are held atleast once a year. Few notable presentations from this year’s meetup:

Checkout the full list here.


Onnx uses Opsets (Operator sets) number which changes with each ONNX package minor/major releases, new opsets usually introduces new operators. Proper opset needs to be used while creating the onnx model graph.

Also it currently doesn’t support 4-bit quantisation (microsoft/onnxruntime#14997).

There are lots of open issues (microsoft/onnxruntime#12880, #10303, #7233, #17116) where users are getting slower inference speed after converting their models to ONNX format when compared to base model format, it shows that conversion might not be easy for all models. On similar grounds an user comments 3 years ago here though it’s old, few points still seems relevant. The troubleshooting guide by ONNX runtime community can help with commonly faced issues.

Usage of Protobuf for storing/reading of ONNX models also seems to be causing few limitations which is discussed here.

There’s a detailed failure analysis (video, ppt) done by James C. Davis and Purvish Jajal on ONNX converters.

Fig. 59 Analysis of Failures and Risks in Deep Learning Model Converters [143]#

The top findings were:

  • Crash (56%) and Wrong Model (33%) are the most common symptoms

  • The most common failure causes are Incompatibility and Type problems, each making up ∼25% of causes

  • The majority of failures are located with the Node Conversion stage (74%), with a further 10% in the Graph optimisation stage (mostly from tf2onnx).


ggerganov/ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware – the “GG” refers to the initials of its originator Georgi Gerganov. In addition to defining low-level machine learning primitives like a tensor type, GGML defines a binary format for distributing large language models (LLMs). llama.cpp and whisper.cpp are based on it.

Features and Benefits#

  • Written in C

  • 16-bit float and integer quantisation support (e.g. 4-bit, 5-bit, 8-bit)

  • Automatic differentiation

  • Built-in optimisation algorithms (e.g. ADAM, L-BFGS)

  • Optimised for Apple Silicon, on x86 arch utilises AVX / AVX2 intrinsics

  • Web support via WebAssembly and WASM SIMD

  • No third-party dependencies

  • zero memory allocations during runtime

To know more, see their manifesto here


Overall GGML is moderate in terms of usability given it’s a fairly new project and growing, but has lots of community support already.

Here’s an example inference of GPT-2 GGML:

git clone
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2

# Run the GPT-2 small 117M model
../examples/gpt-2/ 117M
./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"


For usage, the model should be saved in the particular GGML file format which consists binary-encoded data that has a particular format specifying what kind of data is present in the file, how it is represented, and the order in which it appears.

For a valid GGML file the following pieces of information should be present in order:

  1. GGML version number: To support rapid development without sacrificing backwards-compatibility, GGML uses versioning to introduce improvements that may change the format of the encoding. The first value present in a valid GGML file is a “magic number” that indicates the GGML version that was used to encode the model. Here’s a GPT-2 conversion example where it’s getting written.

  2. Components of LLMs:

    1. Hyperparameters: These are parameters which configures the behaviour of models. Valid GGML files lists these values in the correct order, and each value represented using the correct data type. Here’s an example for GPT-2.

    2. Vocabulary: These are all supported tokens for a model. Here’s an example for GPT-2.

    3. Weights: These are also called parameters of the model. The total number of weights in a model are referred to as the “size” of that model. In GGML format a tensor consists of few components:

      • Name

      • 4 element list representing number of dimensions in the tensor and their lengths

      • List of weights in the tensor

      Let’s consider the following weights:

      weight_1 = [[0.334, 0.21], [0.0, 0.149]]
      weight_2 = [0.123, 0.21, 0.31]

      Then GGML representation would be:

      {"weight_1", [2, 2, 1, 1], [0.334, 0.21, 0.0, 0.149]}
      {"weight_2", [3, 1, 1, 1], [0.123, 0.21, 0.31]}

      For each weight representation the first list denotes dimensions and second list denotes weights. Dimensions list uses 1 as a placeholder for unused dimensions.


Quantisation is a process where high-precision foating point values are converted to low-precision values. This overall reduces the resources required to use the values in Tensor, making model easier to run on low resources. GGML uses a hacky version of quantisation and supports a number of different quantisation strategies (e.g. 4-bit, 5-bit, and 8-bit quantisation), each of which offers different trade-offs between efficiency and performance. Check out this amazing article by Merve for a quick walkthrough.


It’s most used projects include:

  • whisper.cpp

    High-performance inference of OpenAI’s Whisper automatic speech recognition model The project provides a high-quality speech-to-text solution that runs on Mac, Windows, Linux, iOS, Android, Raspberry Pi, and Web. Used by

    Optimised version for Apple Silicon is also available as a Swift package.

  • llama.cpp

    Inference of Meta’s LLaMA large language model

    The project demonstrates efficient inference on Apple Silicon hardware and explores a variety of optimisation techniques and applications of LLMs

Inference and training of many open sourced models (StarCoder, Falcon, Replit, Bert, etc.) are already supported in GGML. Track the full list of updates here.


TheBloke currently has lots of LLM variants already converted to GGML format.

GPU based inference support for GGML format models discussion initiated few months back, examples started with MNIST CNN support, and showing other example of full GPU inference, showed on Apple Silicon using Metal, offloading layers to CPU and making use of GPU and CPU together.

Check llamacpp part of LangChain’s docs on how to use GPU or Metal for GGML models inference. Here’s an example from LangChain docs showing how to use GPU for GGML models inference.

Currently Speculative Decoding for sampling tokens is being implemented (ggerganov/llama.cpp#2926) for Code LLaMA inference as a POC, which as an example promises full float16 precision 34B Code LLAMA at >20 tokens/sec on M2 Ultra.


GGUF format#

There’s a new successor format to GGML named GGUF introduced by llama.cpp team on August 21st 2023. It has an extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenisation code, including for the first time full support for special tokens. Promises to improve performance, especially with models that use new special tokens and implement custom prompt templates.

Some clients & libraries supporting GGUF include:

  • ggerganov/llama.cpp

  • oobabooga/text-generation-webui – the most widely used web UI, with many features and powerful extensions

  • LostRuins/koboldcpp – a fully featured web UI, with full GPU accel across multiple platforms and GPU architectures. Especially good for story telling

  • ParisNeo/lollms-webui – a great web UI with many interesting and unique features, including a full model library for easy model selection

  • marella/ctransformers – a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server

  • abetlen/llama-cpp-python – a Python library with GPU accel, LangChain support, and OpenAI-compatible API server

  • huggingface/candle – a Rust ML framework with a focus on performance, including GPU support, and ease of use

  • LM Studio – an easy-to-use and powerful local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS

See also

For more info on GGUF, see ggerganov/llama.cpp#2398 and its spec.


  • Models are mostly quantised versions of actual models, taking slight hit from quality side if not much. Similar cases reported which is totally expected from a quantised model, some numbers can be found on this reddit discussion.

  • GGML is mostly focused on Large Language Models, but surely looking to expand.

See also


TensorRT is an SDK for deep learning inference by NVIDIA, providing APIs and parsers to import trained models from all major deep learning frameworks which then generates optimised runtime engines deployable in diverse systems.

Features and Benefits#

TensorRT’s main capability comes under giving out high performance inference engines. Few notable features include:

TensorRT can also act as a provider when using onnxruntime delivering better inferencing performance on the same hardware compared to generic GPU acceleration by setting proper Execution Provider.


Using NVIDIA’s TensorRT containers can ease up setup, given it’s known what version of TensorRT, CUDA toolkit (if required).

Fig. 60 Path to convert and deploy with TensorRT.#


While creating a serialised TensorRT engine, except using TF-TRT or ONNX, for higher customisability one can also manually construct a network using the TensorRT API (C++ or Python)

TensorRT also includes a standalone runtime with C++ and Python bindings, apart from directly using NVIDIA’s Triton Inference server for deployment.

ONNX has a TensorRT backend that parses ONNX models for execution with TensorRT, having both Python and C++ support. Current full list of supported ONNX operators for TensorRT is maintained here. It only supports DOUBLE, FLOAT32, FLOAT16, INT8 and BOOL ONNX data types, and limited support for INT32, INT64 and DOUBLE types.

NVIDIA also kept few tooling support around TensorRT:

  • trtexec: For easy generation of TensorRT engines and benchmarking.

  • Polygraphy: A Deep Learning Inference Prototyping and Debugging Toolkit

  • trt-engine-explorer: It contains Python package trex to explore various aspects of a TensorRT engine plan and its associated inference profiling data.

  • onnx-graphsurgeon: It helps easily generate new ONNX graphs, or modify existing ones.

  • polygraphy-extension-trtexec: polygraphy extension which adds support to run inference with trtexec for multiple backends, including TensorRT and ONNX-Runtime, and compare outputs.


Currently every model checkpoint one creates needs to be recompiled first to ONNX and then to TensorRT, so for using microsoft/LoRA it has to be added into the model at compile time. More issues can be found in this reddit post.

INT4 and INT16 quantisation is not supported by TensorRT currently. Current support on quantisation can be found here.

Many ONNX operators are not yet supported by TensorRT and few supported ones have restrictions.

Supports no Interoperability since conversion to onnx or TF-TRT format is a necessary step and has intricacies which needs to be handled for custom requirements.


Work in Progress

Feel free to open a PR :)



This chapter is still being written & reviewed. Please do post links & discussion in the comments below, or open a pull request!

See also:


Missing something important? Let us know in the comments below, or open a pull request!