State of Open Source AI Book - 2023 Edition#

site last updated activity doi

Clarity in the current fast-paced mess of Open Source innovation [1]

As a data scientist/ML engineer/developer with a 9 to 5 job, it’s difficult to keep track of all the innovations. There’s been enormous progress in the field in the last year.

Cure your FOMO with this guide, covering all the most important categories in the Open Source AI space, from model evaluations to deployment. It includes a Glossary for you to quickly check definitions of new frameworks & tools.

A quick TL;DR overview is included at the top of each section. We outline the pros/cons and general context/background for each topic. Then we dive a bit deeper. Examples include data models were trained on, and deployment implementations.

Who is This Guide For?#

Prerequisites to Reading

You should already know the basic principles of MLOps [2, 3, 4], i.e. you should know that the traditional steps are:

  1. Data engineering (preprocessing, curation, labelling, sanitisation)

  2. Model engineering (training, architecture design)

  3. Automated testing (CI)

  4. Deployment/Automated Inference (CD)

  5. Monitoring (logging, feedback, drift detection)

You haven’t followed the most recent developments in open source AI over the last year, and want to catch up quickly. We go beyond just mentioning the models, but also include things such as changing infrastructure, licence pitfalls, and novel applications.

Table of Contents#

We’ve divided the open-source tooling, models, & MLOps landscape into the following chapters:

Chapter

Description

Licences

Weights vs Data, Commercial use, Fair use, Pending lawsuits

Evaluation & Datasets

Leaderboards & Benchmarks for Text/Visual/Audio models

Models

LLaMA 1 vs 2, Stable Diffusion, DALL-E, Persimmon, …

Unaligned Models

FraudGPT, WormGPT, PoisonGPT, WizardLM, Falcon

Fine-tuning

LLMs, Visual, & Audio models

Model Formats

ONNX, GGML, TensorRT

MLOps Engines

vLLM, TGI, Triton, BentoML, …

Vector Databases

Weaviate, Qdrant, Milvus, Redis, Chroma, …

Software Development toolKits

LangChain, LLaMA Index, LiteLLM

Desktop Apps

LMStudio, GPT4All, Koboldcpp, …

Hardware

NVIDIA CUDA, AMD ROCm, Apple Silicon, Intel, TPUs, …

Contributing#

This source of this guide is available on GitHub at premAI-io/state-of-open-source-ai.

Feedback

The current open-source ecosystem is moving at light-speed. Spot something outdated or missing? Want to start a discussion? We welcome any of the following:

Editing the Book#

  • Using GitHub Codespaces, you can edit code & preview the site in your browser without installing anything (you may have to whitelist github.dev, visualstudio.com, github.com, & trafficmanager.net if you use an adblocker).

  • Alternatively, to run locally, open this repository in a Dev Container (most likely using VSCode).

  • Or instead, manually set up your own Python environment:

    pip install -r requirements.txt              # setup
    jupyter-book build --builder dirhtml --all . # build
    python -m http.server -d _build/dirhtml      # serve
    

Formatting#

Note

Don’t worry about making it perfect, it’s fine to open a (draft) PR and allow edits from maintainers to fix it ♥

Contributors#

Anyone who adds a few sentences to a chapter is automatically mentioned in the respective chapter as well as below.

  • Editor: Casper da Costa-Luis (casperdcl)

    With a strong academic background as well industry expertise to backup his enthusiasm for all things open source, Casper is happy to help with all queries related to this book.

  • Maintainer: PremAI-io

    Our vision is to engineer a world where individuals, developers, and businesses can embrace the power of AI without compromising their privacy. We believe in a future where users retain ownership of their data, AND the models trained on it.

  • Citing this book: [1]

Conclusion#

All models are wrong, but some are useful

—G.E.P. Box [6]

Open Source AI represents the future of privacy and ownership of data. On the other hand, in order to make this happen a lot of innovation should come into place. In the last year, already the open-source community demonstrated how motivated they are in order to deliver quality models to the hands of consumers creating already few big innovations in different AI fields. At the same time, this is just the beginning. Many improvements in multiple directions must be made in order to compare the results with centralised solutions.

At Prem we are on a journey to make this possible, with a focus on developer experience and deployment for any sort of developers, from Web Developers with zero knowledge about AI to affirmed Data Scientist who wants to quickly deploy and try these new models and technologies in their existing infra without compromising privacy.

Join our Community#

Glossary#

Alignment#

Aligned AI models must implement safeguards to be helpful, honest, and harmless [7]. This often involves supervised fine-tuning followed by RLHF See Unaligned Models and Fine-tuning.

Auto-regressive language model#

Applies AR to LLMs. Essentially a feed-forward model which predicts the next word given a context (set of words) [8].

BEC#

Business Email Compromise.

Benchmark#

A curated dataset and corresponding tasks designed to evaluate models’ real-world performance metrics (so that models can be compared to each other).

Copyleft#

A type of open licence which insists that derivatives of the IP must have the same licence. Also called “protective” or “reciprocal” [9].

Embedding#

See vector embedding.

Evaluation#

Assessing a model’s abilities using quantitative and qualitative performance metrics (e.g. accuracy, effectiveness, etc.) on a given task. See Evaluation & Datasets.

Fair Dealing#

A doctrine in UK & commonwealth law permitting use of IP without prior permission under certain conditions (typically research, criticism, reporting, or satire) [10]. See also fair use.

Fair Use#

A doctrine in US law permitting use of IP without prior permission (regardless of licence/copyright status) depending on 1) purpose of use, 2) nature of the IP, 3) amount of use, and 4) effect on value [11]. See also fair dealing.

Foundation model#

A model trained from scratch – likely on lots of data – to be used for general tasks or later fine-tuned for specific tasks.

GPU#

Graphics Processing Unit: hardware originally designed to accelerate computer image processing, but now often repurposed for embarrassingly parallel computational tasks in machine learning.

Hallucination#

A model generating output that is inexplicable by its training data.

IP#

Intellectual Property: intangible creations by humans (e.g. code, text, art), typically legally protected from use without permission of the author(s).

Leaderboard#

Ranking of models based on their performance metrics on the same benchmark(s), allowing fair task-specific comparison. See Comparison of Leaderboards.

LLM#

A Large Language Model is neural network (often a transformer containing billions of parameters) designed to perform tasks in natural language via fine tuning or prompt engineering.

MLOps#

Machine Learning Operations: best practices to run AI using software products & cloud services

MoE#

Mixture-of-Experts is a technique which uses one or more specialist model(s) from a collection of models (“experts”) to solve general problems. Not that this is different from ensemble models (which combine results from all models).

Open#

Ambiguous term that could mean “open source” or “open licence”. See Meaning of “Open”.

Permissive#

A type of open licence which allows reselling and closed-source modifications, and can often be used in larger projects alongside other licences. Usually, the only condition of use is citing the author by name.

Perplexity#

Perplexity is a metric based on entropy, and is a rough measure of the difficulty/uncertainty in a prediction problem.

Public Domain#

“Open” IP owned by nobody (often due to the author disclaiming all rights) and thus can be used by anyone without restrictions. Technically a disclaimer/non-licence. See Open licence subcategories.

RAG#

Retrieval Augmented Generation.

RLHF#

Reinforcement Learning from Human Feedback is often the second step in alignment (after supervised fine-tuning), where a model is rewarded or penalised for it outputs based on human evaluation. See Fine-tuning and Unaligned Models.

ROME#

The Rank-One Model Editing algorithm alters a trained model’s weights to directly modify “learned” information [12, 13].

SIMD#

Single Instruction, Multiple Data is a data-level parallel processing technique where one computational instruction is applied to multiple data simultaneously.

SotA#

State of the art: recent developments (under 1 year old).

Supervised fine-tuning#

SFT is often the first step in model alignment, and is usually followed by RLHF. See Fine-tuning and Unaligned Models.

Quantisation#

Sacrificing precision of model weights (e.g. uint8 instead of float32) in return for lower hardware memory requirements.

Token#

A token is a “unit of text” for an LLM to process/generate. A single token could represent a few characters or words, depending on the tokenisation method chosen. Tokens are usually embedded.

Transformer#

A transformer is a neural network using a parallel multi-head attention mechanism. The resultant reduce training time makes it well-suited for use in LLMs.

Vector Database#

Vector databases provide efficient storage & search/retrieval for vector embeddings. See Vector Databases.

Vector Embedding#

Embedding means encoding tokens into a numeric vector (i.e. array/list). This can be thought of as an intermediary between machine and human language, and thus helps LLMs understand human language. See LLM Embeddings.

Vector Store#

See vector database.