life is too short for a diary




AI Engineering - Building Applications with Foundation Models by Chip Huyen

Tags: books ai llm

Author
Written by: Tushar Sharma
Featured image for AI Engineering - Building Applications with Foundation Models by Chip Huyen
  These are my rough notes while reading this book


Overview

This book focuses on the engineering aspects of building applications using foundation models (LLMs). It covers the full lifecycle from understanding models to deployment and monitoring.


Chapter 1: Introduction to Building AI Applications with Foundation Models

Previously, Software as a Service (SaaS) was popular — e.g. Cloudflare, Okta, etc. Now we have Model as a Service: companies like Google (Gemini), Anthropic, and OpenAI (ChatGPT) develop models which others build on top of.

Language models of the 1950s are precursors to modern large language models. A language model predicts how likely a word (or token) is to appear in a given context. e.g. "My favourite color is ___". "Blue" is more likely than "car".

The basic unit of a language model is called a token. A token can be a character, a word, or a part of a word. The process of breaking text into tokens is called tokenization.

The model's vocabulary is a fixed lookup table mapping every known token to a unique number (ID). The model never sees raw text — only sequences of these IDs. For example, "what's the capital of north carolina?" might become [3493, 596, 279, 6864, 315, 4892, 15696, 30].

Vocabulary size is how many entries are in that table. GPT-4 has 100,256 entries; Mixtral 8x7B has 32,000.

This matters in two ways:


Two types of language models:

  1. Masked language model — trained to predict a missing token anywhere in a sequence, using context from both before and after the gap. Example: BERT (Bidirectional Encoder Representations from Transformers).

  2. Autoregressive language model — trained to predict the next token in a sequence, using only the preceding tokens.

The key difference: masked models are bidirectional (they see both sides of a gap), while autoregressive models are unidirectional (they only look backwards). This makes autoregressive models natural text generators — they produce output left to right, just like we write.

Masked LM:     "Why does the [?] cross the road?"
               ← context ←         → context →
               Predicts: "chicken" (using both sides)

Autoregressive LM:  "Why does the chicken" → predicts "cross"
                     ← only previous tokens used

Models can generate open-ended outputs — that's why they're called generative. The completions are predictions based on probabilities, not guaranteed to be correct.


Self-supervision is how language models are trained. Unlike supervised learning (e.g. a weather forecasting model trained on labelled historical data), self-supervision doesn't require human-labelled data. Instead, the model infers its own labels from the input: given a sequence of tokens, the model uses some tokens as context and treats the remaining token(s) as the label to predict. This sidesteps the expensive and slow process of manual data labelling.


Model size is measured by its number of parameters — values inside the model that are adjusted during training. More parameters generally means more capacity to learn patterns.

GPT stands for Generative Pre-trained Transformer.

A foundation model is a large model pre-trained on broad data that can be adapted (fine-tuned) for many downstream tasks.

A multimodal model is one that can work with more than one type of data (modality) — e.g. both text and images. A generative multimodal model is sometimes loosely called a large language model, though strictly LLMs are text-focused.

You can also fine-tune a foundation model on a specific dataset to specialise its behaviour.

Using an external database to supplement the model's knowledge at inference time is called Retrieval-Augmented Generation (RAG).

References & Further Reading


comments powered by Disqus