Stages

Stage 1

1. Data preparation and sampling

2.

code

data

3. Input / Target pairs

context window - number of tokens the LLM can process at once, before it predicts one token. if the context window is 9, there are 9 input-output happening.

  • gemini had 1.5 M input size. That created huge context size and need for huge memory. They overcame that challenge with their architecture.
Input target in the example - sliding window

Input target matrix

data loader

from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
Batches

The entire dataset is divided into batches. when one batch is processed, the parameters are updated. And then we move to the next batch.