Skip to content
Snippets Groups Projects
Verified Commit 1d22594e authored by Benjamin Plattner's avatar Benjamin Plattner Committed by Mirio Eggmann
Browse files

Add Week01 Solution

parent fa200894
Branches
No related tags found
No related merge requests found
Pipeline #360531 failed
%% Cell type:markdown id: tags:
# My First Generative AI Model
***(a simple Swiss first name generator)***
%% Cell type:markdown id: tags:
The notebook introduces key concepts of probability-based generation in a simple and manageable way 😀
In particular, we will build a character-level **bigram** model as a first generative model. The model generates new names by probabilistically selecting the next character in the sequence based on learned patterns. We will count bigram occurences, calculate probabilities based on the number of occurrences to generate a probability distribution.
Probabilities and sampling are a key part of many generative models, where probabilities guide the model's generation process, and sampling allows it to create varied outputs.
- **Probabilities**:
In generative models, the likelihood of different outcomes is typically modeled using probabilities. For example, in our bigram model, each possible next character has a probability associated with it based on the preceding character.
- **Sampling**:
Once you have the probabilities, you can sample from the probability distribution to generate new outcomes. The simplest way is to randomly pick an outcome based on the probabilities (often referred to as sampling from a distribution).
%% Cell type:markdown id: tags:
## First Steps
%% Cell type:code id: tags:
``` python
import torch
from torch import Tensor, Generator
from collections import Counter
from typing import List, Tuple
```
%% Cell type:markdown id: tags:
`swiss_names.txt` is a file containing some of the most popular first names in Switzerland.
FYI - the most popular first names for newborns in Switzerland in 2023 are **Mia** and **Noah** *[(Source)](https://www.bfs.admin.ch/bfs/en/home/statistics/population/births-deaths/first-names.html)* 🐣
%% Cell type:code id: tags:
``` python
with open("./swiss_names.txt", "r") as file:
names = [line for line in file.read().splitlines() if line.strip()]
```
%% Cell type:markdown id: tags:
The following lines do some simple data analysis.
%% Cell type:code id: tags:
``` python
names[:10] # print first 10 names
```
%% Cell type:code id: tags:
``` python
# Number of names
# Should print: 410
len(names)
```
%% Cell type:code id: tags:
``` python
# First occurring shortest name
# Should print: 'Ali'
min(sorted(names), key=len)
```
%% Cell type:code id: tags:
``` python
# First occurring longest name
# Should print: 'Alessandro'
max(sorted(names), key=len)
```
%% Cell type:code id: tags:
``` python
# Number of unique characters
# Should print: 51
len({char for name in names for char in name})
```
%% Cell type:code id: tags:
``` python
# 20 Most frequently occurring characters
# Should print: ('a', 353), ...
Counter(char for name in names for char in name).most_common()[:20]
```
%% Cell type:markdown id: tags:
## Create the Bigrams
%% Cell type:markdown id: tags:
To keep things simple, we will build a bigram model at the character level, which involves counting occurrences of one character following another.
For example, let us consider the word **Emma**
- **Start of sequence** is followed by **E** once,
- **E** is followed by **m** once,
- **m** is followed by **m** once,
- **m** is followed by **a** once, and
- **a** is followed by the **End of sequence**.
Steps to Create Counts for Character Bigrams:
1. **Parse the Text**:
First, you need to iterate through the names and create pairs of consecutive characters (bigrams).
2. **Count Occurrences**:
For each pair, count how many times each character is followed by another specific character.
Example for Emma with the special tokens for start `<S>` and end `<E>` of sequence:
```python
[(('<S>', 'E'), 1),
(('E', 'm'), 1),
(('m', 'm'), 1),
(('m', 'a'), 1),
(('a', '<E>'), 1)]
```
%% Cell type:code id: tags:
``` python
b = {}
for n in names:
# We now add two extra tokens, a start token <S> and an end token <E> to the names.
chs = ["<S>"] + list(n) + ["<E>"]
for pre, post in zip(chs[:-1], chs[1:]):
bigram = (pre, post)
b[bigram] = b.get(bigram, 0) + 1
```
%% Cell type:markdown id: tags:
Let us look at the 20 most frequently occurring bigrams.
%% Cell type:code id: tags:
``` python
# Bigram 'a' <E> should occur 152 times.
sorted(b.items(), key=lambda kv: -kv[1])[:20]
```
%% Cell type:markdown id: tags:
Let us keep the bigram information in a 2-dimensional tensor `N` instead of a python dictionary as it will be easier to work with.
Before we can fill up `N`, we need a to map every character (or more precisely, string) to an integer (`str -> int`) and map back the integers to characters (`int -> str`), basically creating look-up dictionaries.
Note, that we manually add the two special characters, `<S>` and `<E>`, to the dictionaries so that we can reference them later.
%% Cell type:code id: tags:
``` python
chars = sorted(list(set("".join(names))))
print(f"Without special characters: {len(chars)}")
stoi = {s: i for i, s in enumerate(chars)}
# Add the special characters and assign each the next highest number
for token in ["<S>", "<E>"]:
stoi[token] = len(stoi)
# Create the reverse look-up
itos = {i: s for s, i in stoi.items()}
print(f"With special characters: {len(stoi)}")
print(f"<S>: {stoi['<S>']}")
print(f"<E>: {stoi['<E>']}")
print(stoi)
print(itos)
```
%% Cell type:code id: tags:
``` python
# Create place holders to store bigrams
N = torch.zeros((len(stoi), len(stoi)), dtype=torch.int32)
# Add the bigrams
for n in names:
# Add the special tokens before and after each name
chs = ["<S>"] + list(n) + ["<E>"]
# Get all the bigrams and count how often they occur
for pre, post in zip(chs[:-1], chs[1:]):
ix1 = stoi[pre]
ix2 = stoi[post]
N[ix1, ix2] += 1
```
%% Cell type:code id: tags:
``` python
N[stoi["A"]] # A tensor indicating counts for all bigrams starting in 'A'.
```
%% Cell type:markdown id: tags:
Let us now visualize the counts to get a sense of the bigrams.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(24, 24))
plt.imshow(N, cmap="Blues")
for i in range(len(stoi)):
for j in range(len(stoi)):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha="center", va="bottom", color="gray")
plt.text(j, i, str(N[i, j].item()), ha="center", va="top", color="gray")
plt.axis("off");
```
%% Cell type:markdown id: tags:
Making sense of the above matrix:
The row 0 shows the counts for all the bigrams that start in 'A', i.e., ['A', *]. Observe that there are very few bigrams for ['q', *] as well as [*, 'q'] which is also generally true in most names.
More frequently occurring bigrams have a darker coloring.
❓Can you find the five most frequent bigrams? They should correspond to the output of `b` that you printed earlier.
%% Cell type:markdown id: tags:
### Little Inefficiency
We explicitly added start and end tokens. The last row is always zero as the end token is the last token. The second last column is also zero as the start token is the very first token. We could optimize here by combining start and end tokens in a single token that is not used in the names.
We will live with that as of now as the main aim is to create a name generator.
%% Cell type:markdown id: tags:
## Calculating the Probabilities
%% Cell type:markdown id: tags:
### Counts to probabilities
We can easily convert the raw counts to probabilities. The row '0' gives counts for bigrams like AA, AB, AC, ......, ```A<S>``` and ```A<E>```.
Simply speaking, for a character 'A', $$p(next \ character \ being \ B) = count(biagram \ AB) \over count(A*)$$
%% Cell type:code id: tags:
``` python
# We expect a zero probability for B following A
prob = N[0, 1] / N[0].sum()
prob
```
%% Cell type:code id: tags:
``` python
# We expect a non-zero probability for m following A
prob = N[stoi["A"], stoi["m"]] / N[0].sum()
prob
```
%% Cell type:markdown id: tags:
Let us now do this for the entire row in one go.
%% Cell type:code id: tags:
``` python
p = N[0].float()
p = p / p.sum()
print(p)
print(p.sum()) # sum of all probablities is 1
print(N[0].sum())
```
%% Cell type:markdown id: tags:
Let us verify if the highest probability corresponds to the bigram `Al` in row 0.
%% Cell type:code id: tags:
``` python
N[stoi["A"], stoi["l"]] / N[0].sum() == p.max()
```
%% Cell type:markdown id: tags:
### ✅ Task 1
Calculate the probabilities for each row and store the result in tensor `P` such that we can access each row as a probability distribution for a character
Hint: The tensor N is of size `[53, 53]` and contains all the counts (verify this by using `N.shape`). Ensure that `P` is of the same shape.
%% Cell type:code id: tags:
``` python
# '''TODO: Calculate the probabilities and store them in tensor `P`.'''
...
# assert P.shape == N.shape
```
%% Cell type:code id: tags:
``` python
# Solution
P = N / N.sum(dim=-1, keepdim=True)
assert P.shape == N.shape
```
%% Cell type:markdown id: tags:
## Sampling
We would now like to sample from the distribution to generate a next character. You will now implement your own sampling method later.
The code cell below shows you how to sample 3 random numbers from 0 to 1.
It generates a tensor of 3 random numbers, normalizes them so that their sum is 1.
Below, we use `torch.Generator()` to make the code deterministic. The seed is set with manual_seed to ensure reproducibility. It is particularly useful for debugging and ensuring consistency between runs.
%% Cell type:code id: tags:
``` python
seed = 42 # Try different seeds to see the effect.
g = torch.Generator().manual_seed(seed)
r = torch.rand(3, generator=g)
r = r / r.sum()
print(r)
```
%% Cell type:markdown id: tags:
The above code generates 3 random numbers between 0 and 1. We normalize them to create probabilities (sum = 1). We can now use `torch.multinomial` to draw samples from this probability distribution.
`torch.multinomial` will help you sample from a multinomial distribution of probabilities given as a tensor. Let us get samples from this distribution. Note, that the generator is reset to its initial state every time you run the below cell.
%% Cell type:code id: tags:
``` python
g = torch.Generator().manual_seed(42) # Comment this line to see the effect.
idx = torch.multinomial(r, num_samples=20, replacement=True, generator=g)
print(idx)
```
%% Cell type:markdown id: tags:
### ✅ Task 2
Based on the above discussion and using the tensor `P`, write your own sampler using `torch.multinomial` discussed above to sample the next character index from a distribution given the first or initial character.
%% Cell type:code id: tags:
``` python
def char_sampler(
initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator
) -> Tensor:
# '''TODO: Implement the sampler.'''
...
```
%% Cell type:code id: tags:
``` python
# Solution
def char_sampler(
initial_char: str, num_samples: int, probability_distr: Tensor, generator: Generator
) -> Tensor:
if not initial_char:
initial_char = "<S>"
if not num_samples:
num_samples = 1
P = probability_distr
g = generator
idx = stoi[initial_char]
return torch.multinomial(
P[idx], num_samples=num_samples, replacement=True, generator=g
)
```
%% Cell type:markdown id: tags:
## Visualizing the Distribution
The distribution should follow the probabilities if enough samples are drawn.
%% Cell type:code id: tags:
``` python
g = torch.Generator().manual_seed(42)
initial_character = "A"
samples = char_sampler(
initial_character,
10000,
P,
g,
)
# Convert each index in samples to a character
result = [itos[int(x.item())] for x in samples]
print(result[:10])
# Create the histogram
data = samples.numpy()
plt.cla()
counts, bins, patches = plt.hist(data, bins=28, alpha=0.7, rwidth=0.85)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
x_labels = [
itos.get(round(center), "") if count > 0 else ""
for center, count in zip(bin_centers, counts)
]
plt.xticks(bin_centers, x_labels)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title(f"Histogram of characters following {repr(initial_character)}")
plt.show()
```
%% Cell type:markdown id: tags:
Since 10'000 samples are drawn, you should see that the samples follow the given probability distribution.
%% Cell type:markdown id: tags:
### ✅ Task 3
Use the sampler in a loop to keep generating new names. The first token is the special start token `<S>`.
Using the probabilities calculated in `P`, sample the next character from the distribution. Iteratively sample on the new character to generate the next character. Continue the sampling process till you get the special end token `<E>`.
%% Cell type:code id: tags:
``` python
def generate_names(
start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator
) -> List[str]:
names = []
# '''TODO: Implement the logic to sample new characters using the `char_sampler(...)` function until the end token is reached for as many names as passed as an argument.'''
...
return names
```
%% Cell type:code id: tags:
``` python
# Solution
def generate_names(
start_char: str, num_of_names: int, probability_distr: Tensor, generator: Generator
) -> List[str]:
names = []
for _ in range(num_of_names):
idx = stoi[start_char]
name = []
while True:
idx = char_sampler(itos[idx], 1, probability_distr, generator)
idx = int(idx.item()) # Convert tensor to integer.
name.append(itos[idx])
if idx == stoi["<E>"]:
names.append("".join(name))
break
return names
```
%% Cell type:markdown id: tags:
We have now created a function that generates new names. Let us test it with 20 names to generate. Happy with the names?
%% Cell type:code id: tags:
``` python
g = torch.Generator().manual_seed(42)
new_names = generate_names("<S>", 20, P, g)
_ = [print(name) for name in new_names]
```
%% Cell type:markdown id: tags:
Yes! 👏👏👏 We successfully trained our first simple generative model by counting bigrams (how frequently character pairs occur) and normalizing these counts to create probabilities. We iteratively sample the next character from this distribution and feed it back each time to generate the subsequent character.
Our model stores its parameters (the probabilities) in a tensor, which we use for sampling. It is a very explicit and explainable model.
%% Cell type:markdown id: tags:
## Evaluation
%% Cell type:markdown id: tags:
Are you happy with quality of it? Can we even evaluate the quality of the model? Does this sampler really work?? How can you find out?
Let us set the probabilities equally for all the next characters by forcing `P` to have the same probabilities across every row.
%% Cell type:code id: tags:
``` python
# Set probabilities equal
P_equal = torch.ones_like(P, dtype=torch.float32) / len(P)
print(P_equal[0]) # First row probabilities (character 'A')
```
%% Cell type:markdown id: tags:
How true is our model to the training data? The probability that each character has the same likelihood of coming next, we will see a probability of `1/(len(chars) + 2)` or `1/len(stoi)`, i.e., $\frac{1}{53} = 0.0189$
for each bigram. Hence, there is only a ~2% chance that <E> will occur, hence you may see very long "names".
%% Cell type:code id: tags:
``` python
g = torch.Generator().manual_seed(42)
new_names = generate_names("<S>", 20, P_equal, g)
_ = [print(name) for name in new_names]
```
%% Cell type:markdown id: tags:
Let us look at the probabilities our model assigned to the training data for the first two names.
%% Cell type:code id: tags:
``` python
for n in names[:2]:
print(f"{n}:")
chs = ["<S>"] + list(n) + ["<E>"]
for pre, post in zip(chs, chs[1:]):
ix1 = stoi[pre]
ix2 = stoi[post]
prob = P[ix1, ix2]
print(f"{pre}{post}: {prob:.4f}")
```
%% Cell type:markdown id: tags:
Many of the probabilities above are greater than `1/len(chars) + 2`, e.g., No, oa or Li. Even te occurrence of the end of sequence character has probabilities of 18% and 15%, respectively.
%% Cell type:markdown id: tags:
### Likelihood
How can we have a single number that quantifies the quality of the model? Remember **maximum likelihood**? 🤔 Likelihood is the *product* of the probabilities in `P`, the probability of the entire dataset assigned by the model.
However, when dealing with probabilities, especially for large datasets, the product of many small probabilities can become extremely small, leading to numerical underflow. To avoid this, we use the **log-likelihood**.
We will calculate both:
- the log-likelihood: **The higher the number the better**
- the average negative log-likelihood: **The lower the number the better**
as measures of the goodness of the model.
#### Why Log Likelihood?
1. **Numerical Stability**: Multiplying many small probabilities can result in very small numbers that are difficult to represent accurately with floating-point arithmetic. By taking the logarithm of these probabilities, we convert the product into a sum, which is much more stable numerically.
2. **Simplification**: The logarithm of a product is the sum of the logarithms. This property simplifies the calculations:
$$
\log(\prod_{i=1}^{n} p_i) = \sum_{i=1}^{n} \log(p_i)
$$
Instead of multiplying probabilities, we sum their logarithms.
3. **Interpretability**: The log-likelihood provides a single number that quantifies the quality of the model. A higher log-likelihood indicates a better fit to the data.
%% Cell type:markdown id: tags:
### ✅ Task 4
Based on the above brief repetition of the log-likelihood, add the missing calculation steps.
%% Cell type:code id: tags:
``` python
# '''TODO: Add the missing calculations.'''
def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:
log_lhood = 0.0
nums = 0
for n in names:
chs = ["<S>"] + list(n) + ["<E>"]
for pre, post in zip(chs[:-1], chs[1:]):
ix1 = stoi[pre]
ix2 = stoi[post]
prob = P[ix1, ix2]
lprob = ... # '''TODO'''
log_lhood = ... # '''TODO'''
nums += 1
avg_neg_llh = ... # '''TODO'''
return log_lhood, avg_neg_llh
```
%% Cell type:code id: tags:
``` python
# Solution
def calculate_likelihoods(names: List[str], P: torch.Tensor) -> Tuple[float, float]:
log_lhood = 0.0
nums = 0
for n in names:
chs = ["<S>"] + list(n) + ["<E>"]
for pre, post in zip(chs[:-1], chs[1:]):
ix1 = stoi[pre]
ix2 = stoi[post]
prob = P[ix1, ix2]
lprob = float(torch.log(prob))
log_lhood += lprob
nums += 1
avg_neg_llh = -log_lhood / nums
return log_lhood, avg_neg_llh
```
%% Cell type:markdown id: tags:
### Log-Likelihood Comparison
%% Cell type:code id: tags:
``` python
log_lhood_P_equal, avg_neg_llh_P_equal = calculate_likelihoods(names, P_equal)
print(f"Log-likelihood: {log_lhood_P_equal}")
print(f"Average negative log-likelihood: {avg_neg_llh_P_equal:.4f}")
```
%% Cell type:markdown id: tags:
In general, a good model should minimize the average negative log-likelihood loss.
When we compare the "baseline" of the model with equal probabilities to our "model" `P` we should observe an improvement in the likelihood metrics.
%% Cell type:code id: tags:
``` python
log_lhood_P, avg_neg_llh_P = calculate_likelihoods(names, P)
print(f"Log-likelihood: {log_lhood_P}")
print(f"Average negative log-likelihood: {avg_neg_llh_P:.4f}")
```
%% Cell type:markdown id: tags:
It looks like our model performs indeed better than the baseline 🥳
%% Cell type:markdown id: tags:
## Limitations
A character-level model using explicit probabilities that predicts the next likely character is a useful first step in understanding text generation. However, even though it can generate plausible sounding names, it has its limitations.
%% Cell type:markdown id: tags:
### ✅ Task 5
Discuss the shortcomings of a character-level model like the one we just implemented.
%% Cell type:markdown id: tags:
### Solution
Limitations of a character-level model using explicit probabilities:
1. **Inefficiency**: Character-level models can be computationally expensive and slow, as they need to process each character individually. This can lead to longer training times and slower inference speeds.
2. **Non-scalable**: As the length of the text increases, the complexity of the model grows exponentially. This makes it difficult to scale for longer sequences or larger datasets.
3. **Lack of semantic understanding**: Character-level models do not inherently understand the meaning of words or sentences. They operate purely on the level of individual characters, which limits their ability to capture the semantic context of the text.
4. **Inability to capture long-range dependencies**: These models often struggle with capturing dependencies that span over long distances in the text. This is because they focus on immediate next-character predictions without considering the broader context, leading to less coherent and meaningful text generation.
%% Cell type:markdown id: tags:
Next week, as a revision to deep neural networks, we will touch on implicit or more abstract methods of modeling/learning the probabilities. In particular, we will focus on deep neural networks.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment