Add Week01 Solution

Signed-off-by: Benjamin <benjamin.plattner@ost.ch>

Add Week01 Solution
1d22594e · Benjamin Plattner · Mirio Eggmann · fa200894 · 1d22594e
Verified Commit 1d22594e authored 8 months ago by Benjamin Plattner Committed by Mirio Eggmann 8 months ago
--- a/Week01/MyFirstGenarator_solution.ipynb
+++ b/Week01/MyFirstGenarator_solution.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# My First Generative AI Model\n",
+    "***(a simple Swiss first name generator)***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The notebook introduces key concepts of probability-based generation in a simple and manageable way 😀\n",
+    "\n",
+    "In particular, we will build a character-level **bigram** model as a first generative model. The model generates new names by probabilistically selecting the next character in the sequence based on learned patterns. We will count bigram occurences, calculate probabilities based on the number of occurrences to generate a probability distribution.\n",
+    "\n",
+    "Probabilities and sampling are a key part of many generative models, where probabilities guide the model's generation process, and sampling allows it to create varied outputs.\n",
+    "- **Probabilities**:\n",
+    "In generative models, the likelihood of different outcomes is typically modeled using probabilities. For example, in our bigram model, each possible next character has a probability associated with it based on the preceding character.\n",
+    "- **Sampling**: \n",
+    "Once you have the probabilities, you can sample from the probability distribution to generate new outcomes. The simplest way is to randomly pick an outcome based on the probabilities (often referred to as sampling from a distribution). \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## First Steps"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "from torch import Tensor, Generator\n",
+    "from collections import Counter\n",
+    "from typing import List, Tuple"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`swiss_names.txt` is a file containing some of the most popular first names in Switzerland. \n",
+    "\n",
+    "FYI - the most popular first names for newborns in Switzerland in 2023 are **Mia** and **Noah** *[(Source)](https://www.bfs.admin.ch/bfs/en/home/statistics/population/births-deaths/first-names.html)* 🐣"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"./swiss_names.txt\", \"r\") as file:\n",
+    "    names = [line for line in file.read().splitlines() if line.strip()]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following lines do some simple data analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "names[:10]  # print first 10 names"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Number of names\n",
+    "# Should print: 410\n",
+    "len(names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# First occurring shortest name\n",
+    "# Should print: 'Ali'\n",
+    "min(sorted(names), key=len)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# First occurring longest name\n",
+    "# Should print: 'Alessandro'\n",
+    "max(sorted(names), key=len)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Number of unique characters\n",
+    "# Should print: 51\n",
+    "len({char for name in names for char in name})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 20 Most frequently occurring characters\n",
+    "# Should print: ('a', 353), ...\n",
+    "Counter(char for name in names for char in name).most_common()[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Bigrams"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To keep things simple, we will build a bigram model at the character level, which involves counting occurrences of one character following another. \n",
+    "\n",
+    "For example, let us consider the word **Emma**   \n",
+    "- **Start of sequence** is followed by **E** once,\n",
+    "- **E** is followed by **m** once,\n",
+    "- **m** is followed by **m** once,\n",
+    "- **m** is followed by **a** once, and\n",
+    "- **a** is followed by the **End of sequence**.\n",
+    "\n",
+    "\n",
+    "Steps to Create Counts for Character Bigrams:\n",
+    "\n",
+    "1. **Parse the Text**:\n",
+    "First, you need to iterate through the names and create pairs of consecutive characters (bigrams).\n",
+    "\n",
+    "2. **Count Occurrences**:\n",
+    "For each pair, count how many times each character is followed by another specific character.\n",
+    "\n",
+    "Example for Emma with the special tokens for start `<S>` and end `<E>` of sequence:\n",
+    "\n",
+    "```python\n",
+    "[(('<S>', 'E'), 1),\n",
+    " (('E', 'm'),   1),\n",
+    " (('m', 'm'),   1),\n",
+    " (('m', 'a'),   1),\n",
+    " (('a', '<E>'), 1)]\n",
+    " ```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "b = {}\n",
+    "for n in names:\n",
+    "    # We now add two extra tokens, a start token <S> and an end token <E> to the names.\n",
+    "    chs = [\"<S>\"] + list(n) + [\"<E>\"]\n",
+    "    for pre, post in zip(chs[:-1], chs[1:]):\n",
+    "        bigram = (pre, post)\n",
+    "        b[bigram] = b.get(bigram, 0) + 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us look at the 20 most frequently occurring bigrams."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Bigram 'a' <E> should occur 152 times.\n",
+    "sorted(b.items(), key=lambda kv: -kv[1])[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us keep the bigram information in a 2-dimensional tensor `N` instead of a python dictionary as it will be easier to work with.\n",
+    "\n",
+    "Before we can fill up `N`, we need a to map every character (or more precisely, string) to an integer (`str -> int`) and map back the integers to characters (`int -> str`), basically creating look-up dictionaries.\n",
+    "\n",
+    "Note, that we manually add the two special characters, `<S>` and `<E>`, to the dictionaries so that we can reference them later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chars = sorted(list(set(\"\".join(names))))\n",
+    "print(f\"Without special characters: {len(chars)}\")\n",
+    "\n",
+    "stoi = {s: i for i, s in enumerate(chars)}\n",
+    "# Add the special characters and assign each the next highest number\n",
+    "for token in [\"<S>\", \"<E>\"]:\n",
+    "    stoi[token] = len(stoi)\n",
+    "\n",
+    "# Create the reverse look-up\n",
+    "itos = {i: s for s, i in stoi.items()}\n",
+    "\n",
+    "print(f\"With special characters: {len(stoi)}\")\n",
+    "print(f\"<S>: {stoi['<S>']}\")\n",
+    "print(f\"<E>: {stoi['<E>']}\")\n",
+    "print(stoi)\n",
+    "print(itos)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create place holders to store bigrams\n",
+    "N = torch.zeros((len(stoi), len(stoi)), dtype=torch.int32)\n",
+    "\n",
+    "# Add the bigrams\n",
+    "for n in names:\n",
+    "    # Add the special tokens before and after each name\n",
+    "    chs = [\"<S>\"] + list(n) + [\"<E>\"]\n",
+    "    # Get all the bigrams and count how often they occur\n",
+    "    for pre, post in zip(chs[:-1], chs[1:]):\n",
+    "        ix1 = stoi[pre]\n",
+    "        ix2 = stoi[post]\n",
+    "        N[ix1, ix2] += 1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "N[stoi[\"A\"]]  # A tensor indicating counts for all bigrams starting in 'A'."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us now visualize the counts to get a sense of the bigrams."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "plt.figure(figsize=(24, 24))\n",
+    "plt.imshow(N, cmap=\"Blues\")\n",
+    "for i in range(len(stoi)):\n",
+    "    for j in range(len(stoi)):\n",
+    "        chstr = itos[i] + itos[j]\n",
+    "        plt.text(j, i, chstr, ha=\"center\", va=\"bottom\", color=\"gray\")\n",
+    "        plt.text(j, i, str(N[i, j].item()), ha=\"center\", va=\"top\", color=\"gray\")\n",
+    "plt.axis(\"off\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Making sense of the above matrix:\n",
+    "\n",
+    "The row 0 shows the counts for all the bigrams that start in 'A', i.e., ['A', *]. Observe that there are very few bigrams for ['q', *] as well as [*, 'q'] which is also generally true in most names.\n",
+    "\n",
+    "More frequently occurring bigrams have a darker coloring.\n",
+    "\n",
+    "❓Can you find the five most frequent bigrams? They should correspond to the output of `b` that you printed earlier."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Little Inefficiency\n",
+    "We explicitly added start and end tokens. The last row is always zero as the end token is the last token. The second last column is also zero as the start token is the very first token. We could optimize here by combining start and end tokens in a single token that is not used in the names.\n",
+    "\n",
+    "We will live with that as of now as the main aim is to create a name generator. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Calculating the Probabilities"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Counts to probabilities\n",
+    "We can easily convert the raw counts to probabilities. The row '0' gives counts for bigrams like AA, AB, AC, ......, ```A<S>``` and ```A<E>```.\n",
+    "\n",
+    "Simply speaking, for a character 'A', $$p(next \\ character \\ being \\ B) = count(biagram \\ AB) \\over  count(A*)$$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We expect a zero probability for B following A\n",
+    "prob = N[0, 1] / N[0].sum()\n",
+    "prob"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We expect a non-zero probability for m following A\n",
+    "prob = N[stoi[\"A\"], stoi[\"m\"]] / N[0].sum()\n",
+    "prob"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us now do this for the entire row in one go."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "p = N[0].float()\n",
+    "p = p / p.sum()\n",
+    "print(p)\n",
+    "print(p.sum())  # sum of all probablities is 1\n",
+    "\n",
+    "print(N[0].sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us verify if the highest probability corresponds to the bigram `Al` in row 0."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "N[stoi[\"A\"], stoi[\"l\"]] / N[0].sum() == p.max()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ✅ Task 1\n",
+    "\n",
+    "Calculate the probabilities for each row and store the result in tensor `P` such that we can access each row as a probability distribution for a character\n",
+    "\n",
+    "Hint: The tensor N is of size `[53, 53]` and contains all the counts (verify this by using `N.shape`). Ensure that `P` is of the same shape."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# '''TODO: Calculate the probabilities and store them in tensor `P`.'''\n",
+    "...\n",
+    "\n",
+    "# assert P.shape == N.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Solution\n",
+    "P = N / N.sum(dim=-1, keepdim=True)\n",
+    "assert P.shape == N.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Sampling\n",
+    "\n",
+    "We would now like to sample from the distribution to generate a next character. You will now implement your own sampling method later.\n",
+    "\n",
+    "The code cell below shows you how to sample 3 random numbers from 0 to 1.\n",
+    "It generates a tensor of 3 random numbers, normalizes them so that their sum is 1.\n",
+    "\n",
+    "Below, we use `torch.Generator()` to make the code deterministic. The seed is set with manual_seed to ensure reproducibility. It is particularly useful for debugging and ensuring consistency between runs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seed = 42  # Try different seeds to see the effect.\n",
+    "g = torch.Generator().manual_seed(seed)\n",
+    "r = torch.rand(3, generator=g)\n",
+    "r = r / r.sum()\n",
+    "print(r)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above code generates 3 random numbers between 0 and 1. We normalize them to create probabilities (sum = 1).  We can now use `torch.multinomial` to draw samples from this probability distribution.\n",
+    "\n",
+    "`torch.multinomial` will help you sample from a multinomial distribution of probabilities given as a tensor. Let us get samples from this distribution. Note, that the generator is reset to its initial state every time you run the below cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "g = torch.Generator().manual_seed(42)  # Comment this line to see the effect.\n",
+    "idx = torch.multinomial(r, num_samples=20, replacement=True, generator=g)\n",
+    "print(idx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ✅ Task 2\n",
+    "Based on the above discussion and using the tensor `P`, write your own sampler using `torch.multinomial` discussed above to sample the next character index from a distribution given the first or initial character."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def char_sampler(\n",
+    "    initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator\n",
+    ") -> Tensor:\n",
+    "    # '''TODO: Implement the sampler.'''\n",
+    "    ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Solution\n",
+    "def char_sampler(\n",
+    "    initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator\n",
+    ") -> Tensor:\n",
+    "    if not initial_char:\n",
+    "        initial_char = \"<S>\"\n",
+    "    if not num_samples:\n",
+    "        num_samples = 1\n",
+    "    P = probability_distr\n",
+    "    g = generator\n",
+    "    idx = stoi[initial_char]\n",
+    "    return torch.multinomial(\n",
+    "        P[idx], num_samples=num_samples, replacement=True, generator=g\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualizing the Distribution\n",
+    "\n",
+    "\n",
+    "The distribution should follow the probabilities if enough samples are drawn."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "g = torch.Generator().manual_seed(42)\n",
+    "initial_character = \"A\"\n",
+    "samples = char_sampler(\n",
+    "    initial_character,\n",
+    "    10000,\n",
+    "    P,\n",
+    "    g,\n",
+    ")\n",
+    "\n",
+    "# Convert each index in samples to a character\n",
+    "result = [itos[int(x.item())] for x in samples]\n",
+    "print(result[:10])\n",
+    "\n",
+    "# Create the histogram\n",
+    "data = samples.numpy()\n",
+    "plt.cla()\n",
+    "counts, bins, patches = plt.hist(data, bins=28, alpha=0.7, rwidth=0.85)\n",
+    "bin_centers = 0.5 * (bins[:-1] + bins[1:])\n",
+    "x_labels = [\n",
+    "    itos.get(round(center), \"\") if count > 0 else \"\"\n",
+    "    for center, count in zip(bin_centers, counts)\n",
+    "]\n",
+    "plt.xticks(bin_centers, x_labels)\n",
+    "plt.xlabel(\"Value\")\n",
+    "plt.ylabel(\"Frequency\")\n",
+    "plt.title(f\"Histogram of characters following {repr(initial_character)}\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since 10'000 samples are drawn, you should see that the samples follow the given probability distribution."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ✅ Task 3\n",
+    "\n",
+    "Use the sampler in a loop to keep generating new names. The first token is the special start token `<S>`.\n",
+    "\n",
+    "Using the probabilities calculated in `P`, sample the next character from the distribution. Iteratively sample on the new character to generate the next character. Continue the sampling process till you get the special end token `<E>`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_names(\n",
+    "    start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator\n",
+    ") -> List[str]:\n",
+    "    names = []\n",
+    "    # '''TODO: Implement the logic to sample new characters using the `char_sampler(...)` function until the end token is reached for as many names as passed as an argument.'''\n",
+    "    ...\n",
+    "    return names"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Solution\n",
+    "def generate_names(\n",
+    "    start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator\n",
+    ") -> List[str]:\n",
+    "    names = []\n",
+    "    for _ in range(num_of_names):\n",
+    "        idx = stoi[start_char]\n",
+    "        name = []\n",
+    "        while True:\n",
+    "            idx = char_sampler(itos[idx], 1, probability_distr, generator)\n",
+    "            idx = int(idx.item())  # Convert tensor to integer.\n",
+    "            name.append(itos[idx])\n",
+    "            if idx == stoi[\"<E>\"]:\n",
+    "                names.append(\"\".join(name))\n",
+    "                break\n",
+    "    return names"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have now created a function that generates new names. Let us test it with 20 names to generate. Happy with the names?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "g = torch.Generator().manual_seed(42)\n",
+    "new_names = generate_names(\"<S>\", 20, P, g)\n",
+    "_ = [print(name) for name in new_names]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Yes! 👏👏👏 We successfully trained our first simple generative model by counting bigrams (how frequently character pairs occur) and normalizing these counts to create probabilities. We iteratively sample the next character from this distribution and feed it back each time to generate the subsequent character.\n",
+    "\n",
+    "Our model stores its parameters (the probabilities) in a tensor, which we use for sampling. It is a very explicit and explainable model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Are you happy with quality of it? Can we even evaluate the quality of the model? Does this sampler really work?? How can you find out?\n",
+    "\n",
+    "Let us set the probabilities equally for all the next characters by forcing `P` to have the same probabilities across every row."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set probabilities equal\n",
+    "P_equal = torch.ones_like(P, dtype=torch.float32) / len(P)\n",
+    "print(P_equal[0])  # First row probabilities (character 'A')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "How true is our model to the training data? The probability that each character has the same likelihood of coming next, we will see a probability of `1/(len(chars) + 2)` or `1/len(stoi)`, i.e., $\\frac{1}{53} = 0.0189$\n",
+    " for each bigram. Hence, there is only a ~2% chance that <E> will occur, hence you may see very long \"names\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "g = torch.Generator().manual_seed(42)\n",
+    "new_names = generate_names(\"<S>\", 20, P_equal, g)\n",
+    "_ = [print(name) for name in new_names]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us look at the probabilities our model assigned to the training data for the first two names."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for n in names[:2]:\n",
+    "    print(f\"{n}:\")\n",
+    "    chs = [\"<S>\"] + list(n) + [\"<E>\"]\n",
+    "    for pre, post in zip(chs, chs[1:]):\n",
+    "        ix1 = stoi[pre]\n",
+    "        ix2 = stoi[post]\n",
+    "        prob = P[ix1, ix2]\n",
+    "        print(f\"{pre}{post}: {prob:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Many of the probabilities above are greater than `1/len(chars) + 2`, e.g., No, oa or Li. Even te occurrence of the end of sequence character has probabilities of 18% and 15%, respectively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Likelihood\n",
+    "How can we have a single number that quantifies the quality of the model? Remember **maximum likelihood**? 🤔 Likelihood is the *product* of the probabilities in `P`, the probability of the entire dataset assigned by the model.\n",
+    "\n",
+    "However, when dealing with probabilities, especially for large datasets, the product of many small probabilities can become extremely small, leading to numerical underflow. To avoid this, we use the **log-likelihood**.\n",
+    "\n",
+    "We will calculate both:\n",
+    "- the log-likelihood: **The higher the number the better**\n",
+    "- the average negative log-likelihood: **The lower the number the better**\n",
+    "\n",
+    "as measures of the goodness of the model.\n",
+    "\n",
+    "#### Why Log Likelihood?\n",
+    "\n",
+    "1. **Numerical Stability**: Multiplying many small probabilities can result in very small numbers that are difficult to represent accurately with floating-point arithmetic. By taking the logarithm of these probabilities, we convert the product into a sum, which is much more stable numerically.\n",
+    "\n",
+    "2. **Simplification**: The logarithm of a product is the sum of the logarithms. This property simplifies the calculations:\n",
+    "   $$\n",
+    "   \\log(\\prod_{i=1}^{n} p_i) = \\sum_{i=1}^{n} \\log(p_i)\n",
+    "   $$\n",
+    "   Instead of multiplying probabilities, we sum their logarithms.\n",
+    "\n",
+    "3. **Interpretability**: The log-likelihood provides a single number that quantifies the quality of the model. A higher log-likelihood indicates a better fit to the data.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ✅ Task 4\n",
+    "\n",
+    "Based on the above brief repetition of the log-likelihood, add the missing calculation steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# '''TODO: Add the missing calculations.'''\n",
+    "\n",
+    "\n",
+    "def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:\n",
+    "    log_lhood = 0.0\n",
+    "    nums = 0\n",
+    "\n",
+    "    for n in names:\n",
+    "        chs = [\"<S>\"] + list(n) + [\"<E>\"]\n",
+    "        for pre, post in zip(chs[:-1], chs[1:]):\n",
+    "            ix1 = stoi[pre]\n",
+    "            ix2 = stoi[post]\n",
+    "            prob = P[ix1, ix2]\n",
+    "            lprob = ...  # '''TODO'''\n",
+    "            log_lhood = ...  # '''TODO'''\n",
+    "            nums += 1\n",
+    "\n",
+    "    avg_neg_llh = ...  # '''TODO'''\n",
+    "    return log_lhood, avg_neg_llh"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Solution\n",
+    "\n",
+    "\n",
+    "def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:\n",
+    "    log_lhood = 0.0\n",
+    "    nums = 0\n",
+    "\n",
+    "    for n in names:\n",
+    "        chs = [\"<S>\"] + list(n) + [\"<E>\"]\n",
+    "        for pre, post in zip(chs[:-1], chs[1:]):\n",
+    "            ix1 = stoi[pre]\n",
+    "            ix2 = stoi[post]\n",
+    "            prob = P[ix1, ix2]\n",
+    "            lprob = float(torch.log(prob))\n",
+    "            log_lhood += lprob\n",
+    "            nums += 1\n",
+    "\n",
+    "    avg_neg_llh = -log_lhood / nums\n",
+    "    return log_lhood, avg_neg_llh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Log-Likelihood Comparison"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "log_lhood_P_equal, avg_neg_llh_P_equal = calculate_likelihoods(names, P_equal)\n",
+    "\n",
+    "print(f\"Log-likelihood: {log_lhood_P_equal}\")\n",
+    "print(f\"Average negative log-likelihood: {avg_neg_llh_P_equal:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In general, a good model should minimize the average negative log-likelihood loss.\n",
+    "\n",
+    "When we compare the \"baseline\" of the model with equal probabilities to our \"model\" `P` we should observe an improvement in the likelihood metrics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "log_lhood_P, avg_neg_llh_P = calculate_likelihoods(names, P)\n",
+    "\n",
+    "print(f\"Log-likelihood: {log_lhood_P}\")\n",
+    "print(f\"Average negative log-likelihood: {avg_neg_llh_P:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It looks like our model performs indeed better than the baseline 🥳"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Limitations\n",
+    "\n",
+    "A character-level model using explicit probabilities that predicts the next likely character is a useful first step in understanding text generation. However, even though it can generate plausible sounding names, it has its limitations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### ✅ Task 5\n",
+    "\n",
+    "Discuss the shortcomings of a character-level model like the one we just implemented."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Solution\n",
+    "\n",
+    "Limitations of a character-level model using explicit probabilities:\n",
+    "1. **Inefficiency**: Character-level models can be computationally expensive and slow, as they need to process each character individually. This can lead to longer training times and slower inference speeds.\n",
+    "2. **Non-scalable**: As the length of the text increases, the complexity of the model grows exponentially. This makes it difficult to scale for longer sequences or larger datasets.\n",
+    "3. **Lack of semantic understanding**: Character-level models do not inherently understand the meaning of words or sentences. They operate purely on the level of individual characters, which limits their ability to capture the semantic context of the text.\n",
+    "4. **Inability to capture long-range dependencies**: These models often struggle with capturing dependencies that span over long distances in the text. This is because they focus on immediate next-character predictions without considering the broader context, leading to less coherent and meaningful text generation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next week, as a revision to deep neural networks, we will touch on implicit or more abstract methods of modeling/learning the probabilities. In particular, we will focus on deep neural networks."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# My First Generative AI Model
+***(a simple Swiss first name generator)***
+
+%% Cell type:markdown id: tags:
+
+The notebook introduces key concepts of probability-based generation in a simple and manageable way 😀
+
+In particular, we will build a character-level **bigram** model as a first generative model. The model generates new names by probabilistically selecting the next character in the sequence based on learned patterns. We will count bigram occurences, calculate probabilities based on the number of occurrences to generate a probability distribution.
+
+Probabilities and sampling are a key part of many generative models, where probabilities guide the model's generation process, and sampling allows it to create varied outputs.
+- **Probabilities**:
+In generative models, the likelihood of different outcomes is typically modeled using probabilities. For example, in our bigram model, each possible next character has a probability associated with it based on the preceding character.
+- **Sampling**:
+Once you have the probabilities, you can sample from the probability distribution to generate new outcomes. The simplest way is to randomly pick an outcome based on the probabilities (often referred to as sampling from a distribution).
+
+
+%% Cell type:markdown id: tags:
+
+## First Steps
+
+%% Cell type:code id: tags:
+
+``` python
+import torch
+
+from torch import Tensor, Generator
+from collections import Counter
+from typing import List, Tuple
+```
+
+%% Cell type:markdown id: tags:
+
+`swiss_names.txt` is a file containing some of the most popular first names in Switzerland.
+
+FYI - the most popular first names for newborns in Switzerland in 2023 are **Mia** and **Noah** *[(Source)](https://www.bfs.admin.ch/bfs/en/home/statistics/population/births-deaths/first-names.html)* 🐣
+
+%% Cell type:code id: tags:
+
+``` python
+with open("./swiss_names.txt", "r") as file:
+    names = [line for line in file.read().splitlines() if line.strip()]
+```
+
+%% Cell type:markdown id: tags:
+
+The following lines do some simple data analysis.
+
+%% Cell type:code id: tags:
+
+``` python
+names[:10]  # print first 10 names
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Number of names
+# Should print: 410
+len(names)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# First occurring shortest name
+# Should print: 'Ali'
+min(sorted(names), key=len)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# First occurring longest name
+# Should print: 'Alessandro'
+max(sorted(names), key=len)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Number of unique characters
+# Should print: 51
+len({char for name in names for char in name})
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# 20 Most frequently occurring characters
+# Should print: ('a', 353), ...
+Counter(char for name in names for char in name).most_common()[:20]
+```
+
+%% Cell type:markdown id: tags:
+
+## Create the Bigrams
+
+%% Cell type:markdown id: tags:
+
+To keep things simple, we will build a bigram model at the character level, which involves counting occurrences of one character following another.
+
+For example, let us consider the word **Emma**
+- **Start of sequence** is followed by **E** once,
+- **E** is followed by **m** once,
+- **m** is followed by **m** once,
+- **m** is followed by **a** once, and
+- **a** is followed by the **End of sequence**.
+
+
+Steps to Create Counts for Character Bigrams:
+
+1. **Parse the Text**:
+First, you need to iterate through the names and create pairs of consecutive characters (bigrams).
+
+2. **Count Occurrences**:
+For each pair, count how many times each character is followed by another specific character.
+
+Example for Emma with the special tokens for start `<S>` and end `<E>` of sequence:
+
+```python
+[(('<S>', 'E'), 1),
+ (('E', 'm'),   1),
+ (('m', 'm'),   1),
+ (('m', 'a'),   1),
+ (('a', '<E>'), 1)]
+ ```
+
+%% Cell type:code id: tags:
+
+``` python
+b = {}
+for n in names:
+    # We now add two extra tokens, a start token <S> and an end token <E> to the names.
+    chs = ["<S>"] + list(n) + ["<E>"]
+    for pre, post in zip(chs[:-1], chs[1:]):
+        bigram = (pre, post)
+        b[bigram] = b.get(bigram, 0) + 1
+```
+
+%% Cell type:markdown id: tags:
+
+Let us look at the 20 most frequently occurring bigrams.
+
+%% Cell type:code id: tags:
+
+``` python
+# Bigram 'a' <E> should occur 152 times.
+sorted(b.items(), key=lambda kv: -kv[1])[:20]
+```
+
+%% Cell type:markdown id: tags:
+
+Let us keep the bigram information in a 2-dimensional tensor `N` instead of a python dictionary as it will be easier to work with.
+
+Before we can fill up `N`, we need a to map every character (or more precisely, string) to an integer (`str -> int`) and map back the integers to characters (`int -> str`), basically creating look-up dictionaries.
+
+Note, that we manually add the two special characters, `<S>` and `<E>`, to the dictionaries so that we can reference them later.
+
+%% Cell type:code id: tags:
+
+``` python
+chars = sorted(list(set("".join(names))))
+print(f"Without special characters: {len(chars)}")
+
+stoi = {s: i for i, s in enumerate(chars)}
+# Add the special characters and assign each the next highest number
+for token in ["<S>", "<E>"]:
+    stoi[token] = len(stoi)
+
+# Create the reverse look-up
+itos = {i: s for s, i in stoi.items()}
+
+print(f"With special characters: {len(stoi)}")
+print(f"<S>: {stoi['<S>']}")
+print(f"<E>: {stoi['<E>']}")
+print(stoi)
+print(itos)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Create place holders to store bigrams
+N = torch.zeros((len(stoi), len(stoi)), dtype=torch.int32)
+
+# Add the bigrams
+for n in names:
+    # Add the special tokens before and after each name
+    chs = ["<S>"] + list(n) + ["<E>"]
+    # Get all the bigrams and count how often they occur
+    for pre, post in zip(chs[:-1], chs[1:]):
+        ix1 = stoi[pre]
+        ix2 = stoi[post]
+        N[ix1, ix2] += 1
+```
+
+%% Cell type:code id: tags:
+
+``` python
+N[stoi["A"]]  # A tensor indicating counts for all bigrams starting in 'A'.
+```
+
+%% Cell type:markdown id: tags:
+
+Let us now visualize the counts to get a sense of the bigrams.
+
+%% Cell type:code id: tags:
+
+``` python
+%matplotlib inline
+import matplotlib.pyplot as plt
+
+plt.figure(figsize=(24, 24))
+plt.imshow(N, cmap="Blues")
+for i in range(len(stoi)):
+    for j in range(len(stoi)):
+        chstr = itos[i] + itos[j]
+        plt.text(j, i, chstr, ha="center", va="bottom", color="gray")
+        plt.text(j, i, str(N[i, j].item()), ha="center", va="top", color="gray")
+plt.axis("off");
+```
+
+%% Cell type:markdown id: tags:
+
+Making sense of the above matrix:
+
+The row 0 shows the counts for all the bigrams that start in 'A', i.e., ['A', *]. Observe that there are very few bigrams for ['q', *] as well as [*, 'q'] which is also generally true in most names.
+
+More frequently occurring bigrams have a darker coloring.
+
+❓Can you find the five most frequent bigrams? They should correspond to the output of `b` that you printed earlier.
+
+%% Cell type:markdown id: tags:
+
+### Little Inefficiency
+We explicitly added start and end tokens. The last row is always zero as the end token is the last token. The second last column is also zero as the start token is the very first token. We could optimize here by combining start and end tokens in a single token that is not used in the names.
+
+We will live with that as of now as the main aim is to create a name generator.
+
+%% Cell type:markdown id: tags:
+
+## Calculating the Probabilities
+
+%% Cell type:markdown id: tags:
+
+### Counts to probabilities
+We can easily convert the raw counts to probabilities. The row '0' gives counts for bigrams like AA, AB, AC, ......, ```A<S>``` and ```A<E>```.
+
+Simply speaking, for a character 'A', $$p(next \ character \ being \ B) = count(biagram \ AB) \over  count(A*)$$
+
+%% Cell type:code id: tags:
+
+``` python
+# We expect a zero probability for B following A
+prob = N[0, 1] / N[0].sum()
+prob
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# We expect a non-zero probability for m following A
+prob = N[stoi["A"], stoi["m"]] / N[0].sum()
+prob
+```
+
+%% Cell type:markdown id: tags:
+
+Let us now do this for the entire row in one go.
+
+%% Cell type:code id: tags:
+
+``` python
+p = N[0].float()
+p = p / p.sum()
+print(p)
+print(p.sum())  # sum of all probablities is 1
+
+print(N[0].sum())
+```
+
+%% Cell type:markdown id: tags:
+
+Let us verify if the highest probability corresponds to the bigram `Al` in row 0.
+
+%% Cell type:code id: tags:
+
+``` python
+N[stoi["A"], stoi["l"]] / N[0].sum() == p.max()
+```
+
+%% Cell type:markdown id: tags:
+
+### ✅ Task 1
+
+Calculate the probabilities for each row and store the result in tensor `P` such that we can access each row as a probability distribution for a character
+
+Hint: The tensor N is of size `[53, 53]` and contains all the counts (verify this by using `N.shape`). Ensure that `P` is of the same shape.
+
+%% Cell type:code id: tags:
+
+``` python
+# '''TODO: Calculate the probabilities and store them in tensor `P`.'''
+...
+
+# assert P.shape == N.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Solution
+P = N / N.sum(dim=-1, keepdim=True)
+assert P.shape == N.shape
+```
+
+%% Cell type:markdown id: tags:
+
+## Sampling
+
+We would now like to sample from the distribution to generate a next character. You will now implement your own sampling method later.
+
+The code cell below shows you how to sample 3 random numbers from 0 to 1.
+It generates a tensor of 3 random numbers, normalizes them so that their sum is 1.
+
+Below, we use `torch.Generator()` to make the code deterministic. The seed is set with manual_seed to ensure reproducibility. It is particularly useful for debugging and ensuring consistency between runs.
+
+%% Cell type:code id: tags:
+
+``` python
+seed = 42  # Try different seeds to see the effect.
+g = torch.Generator().manual_seed(seed)
+r = torch.rand(3, generator=g)
+r = r / r.sum()
+print(r)
+```
+
+%% Cell type:markdown id: tags:
+
+The above code generates 3 random numbers between 0 and 1. We normalize them to create probabilities (sum = 1).  We can now use `torch.multinomial` to draw samples from this probability distribution.
+
+`torch.multinomial` will help you sample from a multinomial distribution of probabilities given as a tensor. Let us get samples from this distribution. Note, that the generator is reset to its initial state every time you run the below cell.
+
+%% Cell type:code id: tags:
+
+``` python
+g = torch.Generator().manual_seed(42)  # Comment this line to see the effect.
+idx = torch.multinomial(r, num_samples=20, replacement=True, generator=g)
+print(idx)
+```
+
+%% Cell type:markdown id: tags:
+
+### ✅ Task 2
+Based on the above discussion and using the tensor `P`, write your own sampler using `torch.multinomial` discussed above to sample the next character index from a distribution given the first or initial character.
+
+%% Cell type:code id: tags:
+
+``` python
+def char_sampler(
+    initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator
+) -> Tensor:
+    # '''TODO: Implement the sampler.'''
+    ...
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Solution
+def char_sampler(
+    initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator
+) -> Tensor:
+    if not initial_char:
+        initial_char = "<S>"
+    if not num_samples:
+        num_samples = 1
+    P = probability_distr
+    g = generator
+    idx = stoi[initial_char]
+    return torch.multinomial(
+        P[idx], num_samples=num_samples, replacement=True, generator=g
+    )
+```
+
+%% Cell type:markdown id: tags:
+
+## Visualizing the Distribution
+
+
+The distribution should follow the probabilities if enough samples are drawn.
+
+%% Cell type:code id: tags:
+
+``` python
+g = torch.Generator().manual_seed(42)
+initial_character = "A"
+samples = char_sampler(
+    initial_character,
+    10000,
+    P,
+    g,
+)
+
+# Convert each index in samples to a character
+result = [itos[int(x.item())] for x in samples]
+print(result[:10])
+
+# Create the histogram
+data = samples.numpy()
+plt.cla()
+counts, bins, patches = plt.hist(data, bins=28, alpha=0.7, rwidth=0.85)
+bin_centers = 0.5 * (bins[:-1] + bins[1:])
+x_labels = [
+    itos.get(round(center), "") if count > 0 else ""
+    for center, count in zip(bin_centers, counts)
+]
+plt.xticks(bin_centers, x_labels)
+plt.xlabel("Value")
+plt.ylabel("Frequency")
+plt.title(f"Histogram of characters following {repr(initial_character)}")
+plt.show()
+```
+
+%% Cell type:markdown id: tags:
+
+Since 10'000 samples are drawn, you should see that the samples follow the given probability distribution.
+
+%% Cell type:markdown id: tags:
+
+### ✅ Task 3
+
+Use the sampler in a loop to keep generating new names. The first token is the special start token `<S>`.
+
+Using the probabilities calculated in `P`, sample the next character from the distribution. Iteratively sample on the new character to generate the next character. Continue the sampling process till you get the special end token `<E>`.
+
+%% Cell type:code id: tags:
+
+``` python
+def generate_names(
+    start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator
+) -> List[str]:
+    names = []
+    # '''TODO: Implement the logic to sample new characters using the `char_sampler(...)` function until the end token is reached for as many names as passed as an argument.'''
+    ...
+    return names
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Solution
+def generate_names(
+    start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator
+) -> List[str]:
+    names = []
+    for _ in range(num_of_names):
+        idx = stoi[start_char]
+        name = []
+        while True:
+            idx = char_sampler(itos[idx], 1, probability_distr, generator)
+            idx = int(idx.item())  # Convert tensor to integer.
+            name.append(itos[idx])
+            if idx == stoi["<E>"]:
+                names.append("".join(name))
+                break
+    return names
+```
+
+%% Cell type:markdown id: tags:
+
+We have now created a function that generates new names. Let us test it with 20 names to generate. Happy with the names?
+
+%% Cell type:code id: tags:
+
+``` python
+g = torch.Generator().manual_seed(42)
+new_names = generate_names("<S>", 20, P, g)
+_ = [print(name) for name in new_names]
+```
+
+%% Cell type:markdown id: tags:
+
+Yes! 👏👏👏 We successfully trained our first simple generative model by counting bigrams (how frequently character pairs occur) and normalizing these counts to create probabilities. We iteratively sample the next character from this distribution and feed it back each time to generate the subsequent character.
+
+Our model stores its parameters (the probabilities) in a tensor, which we use for sampling. It is a very explicit and explainable model.
+
+%% Cell type:markdown id: tags:
+
+## Evaluation
+
+%% Cell type:markdown id: tags:
+
+Are you happy with quality of it? Can we even evaluate the quality of the model? Does this sampler really work?? How can you find out?
+
+Let us set the probabilities equally for all the next characters by forcing `P` to have the same probabilities across every row.
+
+%% Cell type:code id: tags:
+
+``` python
+# Set probabilities equal
+P_equal = torch.ones_like(P, dtype=torch.float32) / len(P)
+print(P_equal[0])  # First row probabilities (character 'A')
+```
+
+%% Cell type:markdown id: tags:
+
+How true is our model to the training data? The probability that each character has the same likelihood of coming next, we will see a probability of `1/(len(chars) + 2)` or `1/len(stoi)`, i.e., $\frac{1}{53} = 0.0189$
+ for each bigram. Hence, there is only a ~2% chance that <E> will occur, hence you may see very long "names".
+
+%% Cell type:code id: tags:
+
+``` python
+g = torch.Generator().manual_seed(42)
+new_names = generate_names("<S>", 20, P_equal, g)
+_ = [print(name) for name in new_names]
+```
+
+%% Cell type:markdown id: tags:
+
+Let us look at the probabilities our model assigned to the training data for the first two names.
+
+%% Cell type:code id: tags:
+
+``` python
+for n in names[:2]:
+    print(f"{n}:")
+    chs = ["<S>"] + list(n) + ["<E>"]
+    for pre, post in zip(chs, chs[1:]):
+        ix1 = stoi[pre]
+        ix2 = stoi[post]
+        prob = P[ix1, ix2]
+        print(f"{pre}{post}: {prob:.4f}")
+```
+
+%% Cell type:markdown id: tags:
+
+Many of the probabilities above are greater than `1/len(chars) + 2`, e.g., No, oa or Li. Even te occurrence of the end of sequence character has probabilities of 18% and 15%, respectively.
+
+%% Cell type:markdown id: tags:
+
+### Likelihood
+How can we have a single number that quantifies the quality of the model? Remember **maximum likelihood**? 🤔 Likelihood is the *product* of the probabilities in `P`, the probability of the entire dataset assigned by the model.
+
+However, when dealing with probabilities, especially for large datasets, the product of many small probabilities can become extremely small, leading to numerical underflow. To avoid this, we use the **log-likelihood**.
+
+We will calculate both:
+- the log-likelihood: **The higher the number the better**
+- the average negative log-likelihood: **The lower the number the better**
+
+as measures of the goodness of the model.
+
+#### Why Log Likelihood?
+
+1. **Numerical Stability**: Multiplying many small probabilities can result in very small numbers that are difficult to represent accurately with floating-point arithmetic. By taking the logarithm of these probabilities, we convert the product into a sum, which is much more stable numerically.
+
+2. **Simplification**: The logarithm of a product is the sum of the logarithms. This property simplifies the calculations:
+   $$
+   \log(\prod_{i=1}^{n} p_i) = \sum_{i=1}^{n} \log(p_i)
+   $$
+   Instead of multiplying probabilities, we sum their logarithms.
+
+3. **Interpretability**: The log-likelihood provides a single number that quantifies the quality of the model. A higher log-likelihood indicates a better fit to the data.
+
+%% Cell type:markdown id: tags:
+
+### ✅ Task 4
+
+Based on the above brief repetition of the log-likelihood, add the missing calculation steps.
+
+%% Cell type:code id: tags:
+
+``` python
+# '''TODO: Add the missing calculations.'''
+
+
+def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:
+    log_lhood = 0.0
+    nums = 0
+
+    for n in names:
+        chs = ["<S>"] + list(n) + ["<E>"]
+        for pre, post in zip(chs[:-1], chs[1:]):
+            ix1 = stoi[pre]
+            ix2 = stoi[post]
+            prob = P[ix1, ix2]
+            lprob = ...  # '''TODO'''
+            log_lhood = ...  # '''TODO'''
+            nums += 1
+
+    avg_neg_llh = ...  # '''TODO'''
+    return log_lhood, avg_neg_llh
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Solution
+
+
+def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:
+    log_lhood = 0.0
+    nums = 0
+
+    for n in names:
+        chs = ["<S>"] + list(n) + ["<E>"]
+        for pre, post in zip(chs[:-1], chs[1:]):
+            ix1 = stoi[pre]
+            ix2 = stoi[post]
+            prob = P[ix1, ix2]
+            lprob = float(torch.log(prob))
+            log_lhood += lprob
+            nums += 1
+
+    avg_neg_llh = -log_lhood / nums
+    return log_lhood, avg_neg_llh
+```
+
+%% Cell type:markdown id: tags:
+
+### Log-Likelihood Comparison
+
+%% Cell type:code id: tags:
+
+``` python
+log_lhood_P_equal, avg_neg_llh_P_equal = calculate_likelihoods(names, P_equal)
+
+print(f"Log-likelihood: {log_lhood_P_equal}")
+print(f"Average negative log-likelihood: {avg_neg_llh_P_equal:.4f}")
+```
+
+%% Cell type:markdown id: tags:
+
+In general, a good model should minimize the average negative log-likelihood loss.
+
+When we compare the "baseline" of the model with equal probabilities to our "model" `P` we should observe an improvement in the likelihood metrics.
+
+%% Cell type:code id: tags:
+
+``` python
+log_lhood_P, avg_neg_llh_P = calculate_likelihoods(names, P)
+
+print(f"Log-likelihood: {log_lhood_P}")
+print(f"Average negative log-likelihood: {avg_neg_llh_P:.4f}")
+```
+
+%% Cell type:markdown id: tags:
+
+It looks like our model performs indeed better than the baseline 🥳
+
+%% Cell type:markdown id: tags:
+
+## Limitations
+
+A character-level model using explicit probabilities that predicts the next likely character is a useful first step in understanding text generation. However, even though it can generate plausible sounding names, it has its limitations.
+
+%% Cell type:markdown id: tags:
+
+### ✅ Task 5
+
+Discuss the shortcomings of a character-level model like the one we just implemented.
+
+%% Cell type:markdown id: tags:
+
+### Solution
+
+Limitations of a character-level model using explicit probabilities:
+1. **Inefficiency**: Character-level models can be computationally expensive and slow, as they need to process each character individually. This can lead to longer training times and slower inference speeds.
+2. **Non-scalable**: As the length of the text increases, the complexity of the model grows exponentially. This makes it difficult to scale for longer sequences or larger datasets.
+3. **Lack of semantic understanding**: Character-level models do not inherently understand the meaning of words or sentences. They operate purely on the level of individual characters, which limits their ability to capture the semantic context of the text.
+4. **Inability to capture long-range dependencies**: These models often struggle with capturing dependencies that span over long distances in the text. This is because they focus on immediate next-character predictions without considering the broader context, leading to less coherent and meaningful text generation.
+
+%% Cell type:markdown id: tags:
+
+Next week, as a revision to deep neural networks, we will touch on implicit or more abstract methods of modeling/learning the probabilities. In particular, we will focus on deep neural networks.