llms-from-scratch-cn/Translated_Book/ch02/2.6使用滑动窗口进行数据采样.ipynb
2026-03-26 12:16:45 +08:00

688 lines
20 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "bda8a8f9",
"metadata": {},
"source": [
"# 2.6 使用滑动窗口进行数据采样"
]
},
{
"cell_type": "markdown",
"id": "be50c3e5",
"metadata": {},
"source": [
"上一节我们详细介绍了 token 化步骤以及将字符串 token 转换为整数 token ID 的过程。在创建 LLM 的 Embedding 之前,我们需要生成训练 LLM 所需的输入-目标input-target对。"
]
},
{
"cell_type": "markdown",
"id": "c244c8c8",
"metadata": {},
"source": [
"这些输入-目标对是什么样的呢?正如我们在第 1 章中学到的LLM 是通过预测文本中的下一个单词来进行预训练的,如图 2.1 所示。"
]
},
{
"cell_type": "markdown",
"id": "3e13d6b8",
"metadata": {},
"source": [
"**图 2.12 给定一个文本样本,提取输入块作为 LLM 的输入子样本LLM 在训练期间的任务是预测输入块之后的下一个单词。在训练过程中,我们会屏蔽掉目标词之后的所有单词。请注意,在 LLM 处理文本之前,该文本已经进行 token 化,为了便于说明,此图省略了 token 化步骤。**"
]
},
{
"cell_type": "markdown",
"id": "e843aae2",
"metadata": {},
"source": [
"![fig2.12](../img/fig-2-12.jpg)"
]
},
{
"cell_type": "markdown",
"id": "a4b01652",
"metadata": {},
"source": [
"本章,我们实现了一个数据加载器,它使用滑动窗口方法从训练数据集中获取图 2.12 中描述的输入-目标对。"
]
},
{
"cell_type": "markdown",
"id": "4687ffa8",
"metadata": {},
"source": [
"首先,我们将使用前一节介绍的 BPE 分词器对整个《The Verdict》短篇故事进行分词。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b90a181f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5145\n"
]
}
],
"source": [
"import re\n",
"import tiktoken\n",
"\n",
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
" raw_text = f.read()\n",
"enc_text = tokenizer.encode(raw_text)\n",
"print(len(enc_text))"
]
},
{
"cell_type": "markdown",
"id": "0a6c385a",
"metadata": {},
"source": [
"对训练集应用 BPE 分词器后获得 5145 个 tokens。"
]
},
{
"cell_type": "markdown",
"id": "c65d24dc",
"metadata": {},
"source": [
"接下来,我们将从数据集中剔除前 50 个 toekns以便在后续步骤中展示更吸引人的文本段落"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fd0174a6",
"metadata": {},
"outputs": [],
"source": [
"enc_sample = enc_text[50:]"
]
},
{
"cell_type": "markdown",
"id": "8ef4320b",
"metadata": {},
"source": [
"在创建下一个单词预测任务的输入-目标对时,一种简单直观的方法是创建两个变量 x 和 y。其中x 用于存储输入的 token 序列,而 y 则用于存放目标 token 序列。目标序列由输入序列中的每个 token 向右移动一个位置构成。从而形成了输入-目标对。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6aba6df3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x: [290, 4920, 2241, 287]\n",
"y: [4920, 2241, 287, 257]\n"
]
}
],
"source": [
"context_size = 4 #A\n",
"x = enc_sample[:context_size]\n",
"y = enc_sample[1:context_size+1]\n",
"print(f\"x: {x}\")\n",
"print(f\"y: {y}\")"
]
},
{
"cell_type": "markdown",
"id": "d2d58328",
"metadata": {},
"source": [
"运行上述代码会打印出以下输出结果:"
]
},
{
"cell_type": "markdown",
"id": "58311589",
"metadata": {},
"source": [
"```\n",
"x: [290, 4920, 2241, 287]\n",
"y: [4920, 2241, 287, 257]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "92f6bdd3",
"metadata": {},
"source": [
"我们通过将输入数据向右移动一个位置来生成对应的目标数据后。可以参照图 2.12,按照以下步骤创建下一个单词的预测任务:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cf4f75b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[290] ----> 4920\n",
"[290, 4920] ----> 2241\n",
"[290, 4920, 2241] ----> 287\n",
"[290, 4920, 2241, 287] ----> 257\n"
]
}
],
"source": [
"for i in range(1, context_size+1):\n",
" context = enc_sample[:i]\n",
" desired = enc_sample[i]\n",
" print(context, \"---->\", desired)"
]
},
{
"cell_type": "markdown",
"id": "dc6a1a42",
"metadata": {},
"source": [
"上面的代码会打印出以下内容:"
]
},
{
"cell_type": "markdown",
"id": "ade7beab",
"metadata": {},
"source": [
"```\n",
"[290] ----> 4920\n",
"[290, 4920] ----> 2241\n",
"[290, 4920, 2241] ----> 287\n",
"[290, 4920, 2241, 287] ----> 257\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "5e1a7575",
"metadata": {},
"source": [
"箭头(---->)左边的内容指的是 LLM 接收到的输入,箭头右边的 token ID 代表 LLM 应该预测的目标 token ID。"
]
},
{
"cell_type": "markdown",
"id": "37865c93",
"metadata": {},
"source": [
"为了便于更好地理解,我们重复之前的代码,但这次我们将 token ID 转换回文本:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "def4b071",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" and ----> established\n",
" and established ----> himself\n",
" and established himself ----> in\n",
" and established himself in ----> a\n"
]
}
],
"source": [
"for i in range(1, context_size+1):\n",
" context = enc_sample[:i]\n",
" desired = enc_sample[i]\n",
" print(tokenizer.decode(context), \"---->\", tokenizer.decode([desired]))"
]
},
{
"cell_type": "markdown",
"id": "c5326f0b",
"metadata": {},
"source": [
"下面的输出显示了输入和输出的文本格式:"
]
},
{
"cell_type": "markdown",
"id": "18fb7f17",
"metadata": {},
"source": [
"```\n",
"and ----> established\n",
"and established ----> himself\n",
"and established himself ----> in\n",
"and established himself in ----> a\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "6962cc5f",
"metadata": {},
"source": [
"现在,我们已经创建了输入-目标对,我们可以在接下来的章节中其进行 LLM 训练了"
]
},
{
"cell_type": "markdown",
"id": "f0031bc0",
"metadata": {},
"source": [
"在我们将 token 转换为 embedding 向量之前,还有最后一个任务要完成。正如我们在本章开头提到的,还需要实现一个高效的数据加载器,它遍历输入数据集并返回输入-目标对。这些输入和目标都是 PyTorch 张量,可以理解为多维数组。"
]
},
{
"cell_type": "markdown",
"id": "fe61ab29",
"metadata": {},
"source": [
"具体来说,我们希望返回两个张量:一个是输入张量,包含 LLM 看到的文本;另一个是目标张量,包含 LLM 要预测的目标,如图 2.13 所示。"
]
},
{
"cell_type": "markdown",
"id": "0e70acbd",
"metadata": {},
"source": [
"**图 2.13 为了实现高效的数据加载器,我们将所有输入存存储到一个名为 x 的张量中,其中每一行都代表一个输入上下文。同时,我们创建了另一个名为 y 的张量,用于存储对应的预测目标(即下一个单词),这些目标是通过将输入内容向右移动一个位置得到的。**"
]
},
{
"cell_type": "markdown",
"id": "70af4d55",
"metadata": {},
"source": [
"![fig2.13](../img/fig-2-13.jpg)"
]
},
{
"cell_type": "markdown",
"id": "f6ee3a31",
"metadata": {},
"source": [
"为了便于理解,图 2.13 显示的是字符串格式的 token但在代码实现中我们将直接操作 token ID。这是因为 BPE 分词器的 `encode` 方法将分词和转换为 token ID 两个步骤合并为一步。"
]
},
{
"cell_type": "markdown",
"id": "e70a9223",
"metadata": {},
"source": [
"为了实现高效的数据加载器,我们将使用 PyTorch 内置的 Dataset 和 DataLoader 类。有关安装 PyTorch 的其他信息和指导,请参阅附录 A 中的第 A.1.3 节 \"安装 PyTorch\"。"
]
},
{
"cell_type": "markdown",
"id": "86aa6eb9",
"metadata": {},
"source": [
"数据集类的代码如代码示例 2.5 所示:"
]
},
{
"cell_type": "markdown",
"id": "38b4407e",
"metadata": {},
"source": [
"### 代码示例 2.5 一个用于批处理输入和目标的数据集"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "fd2603c6",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from torch.utils.data import Dataset, DataLoader\n",
"class GPTDatasetV1(Dataset):\n",
" def __init__(self, txt, tokenizer, max_length, stride):\n",
" self.tokenizer = tokenizer\n",
" self.input_ids = []\n",
" self.target_ids = []\n",
" token_ids = tokenizer.encode(txt) #A\n",
" for i in range(0, len(token_ids) - max_length, stride): #B\n",
" input_chunk = token_ids[i:i + max_length]\n",
" target_chunk = token_ids[i + 1: i + max_length + 1]\n",
" self.input_ids.append(torch.tensor(input_chunk))\n",
" self.target_ids.append(torch.tensor(target_chunk))\n",
"\n",
" def __len__(self): #C\n",
" return len(self.input_ids)\n",
" def __getitem__(self, idx): #D\n",
" return self.input_ids[idx], self.target_ids[idx]"
]
},
{
"cell_type": "markdown",
"id": "c7be19f1",
"metadata": {},
"source": [
"在代码清单 2.5 中,我们构建了一个名为`GPTDatasetV1`的类,它是 PyTorch 的`Dataset`类的子类。此类规定了如何从数据集中抽取单个样本,每个样本包含一定数量的 token ID这些 ID 存储在`input_chunk`张量中(数量由`max_length`参数决定)。并用`target_chunk`张量保存与输入相对应的目标。为了更深入地理解这个过程,建议你继续阅读,看看当数据集与 PyTorch 的 DataLoader 结合使用时,返回的数据具体是什么样的。"
]
},
{
"cell_type": "markdown",
"id": "661df5f2",
"metadata": {},
"source": [
"如果你对 PyTorch 的 Dataset 类的结构不太熟悉,如代码清单 2.5 所示。我建议你阅读附录 A 的第 A.6 节 \"构建高效的数据加载器\"。在那里,你将找到关于 PyTorch 的 Dataset 和 DataLoader 类的基本结构和使用方法的详细解释。"
]
},
{
"cell_type": "markdown",
"id": "784a3f78",
"metadata": {},
"source": [
"以下代码将使用`GPTDatasetV1`通过 PyTorch 的`DataLoader`来批量加载输入数据:"
]
},
{
"cell_type": "markdown",
"id": "c8b11e56",
"metadata": {},
"source": [
"### 代码清单 2.6,用于生成输入-目标对的批次数据加载器:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "642ff507",
"metadata": {},
"outputs": [],
"source": [
"def create_dataloader_v1(txt, batch_size=4,\n",
" max_length=256, stride=128, shuffle=True, drop_last=True):\n",
" tokenizer = tiktoken.get_encoding(\"gpt2\") #A\n",
" dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #B\n",
" dataloader = DataLoader(\n",
" dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) #C\n",
" return dataloader"
]
},
{
"cell_type": "markdown",
"id": "519d898d",
"metadata": {},
"source": [
"为了更直观地理解代码清单 2.5 中的`GPTDatasetV1`类和代码清单 2.6 中的`create_dataloader_v1`函数如何协同工作我们将在上下文大小context size为 4 的 LLM 中测试批量大小batch size为 1 的数据加载器:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3197531a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]\n"
]
}
],
"source": [
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
" raw_text = f.read()\n",
" dataloader = create_dataloader_v1(\n",
" raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)\n",
" data_iter = iter(dataloader) #A\n",
" first_batch = next(data_iter)\n",
" print(first_batch)"
]
},
{
"cell_type": "markdown",
"id": "9b3335ce",
"metadata": {},
"source": [
"执行上面的代码会打印出以下内容:"
]
},
{
"cell_type": "markdown",
"id": "9ef651c1",
"metadata": {},
"source": [
"```\n",
"[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "1e401b36",
"metadata": {},
"source": [
"`first_batch`变量包含两个张量:第一个张量存储输入的 token ID第二个张量存储目标的 token ID。由于`max_length`为4因此这两个张量都只包含 4 个 toekn ID。需要注意的是这里的输入大小 4 是相对较小的,仅用于演示。在实际训练语言模型时,输入大小通常至少为 256。\n",
"\n",
"为了说明`stride=1`的含义,让我们从这个数据集中获取另一个批次的数据:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]\n"
]
}
],
"source": [
"second_batch = next(data_iter)\n",
"print(second_batch)"
]
},
{
"cell_type": "markdown",
"id": "59f6144c",
"metadata": {},
"source": [
"第二批数据如下"
]
},
{
"cell_type": "markdown",
"id": "62d0bc04",
"metadata": {},
"source": [
"```\n",
"[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "3e2daa0c",
"metadata": {},
"source": [
"通过比较第一批和第二批数据,我们可以发现第二批的 token ID 相较于第一批向右移动了一个位置(例如,第一批输入的第二个 ID 是 367这正是第二批输入的第一个 ID。`stride`参数决定了输入在各批次之间移动的位置数,这模拟了滑动窗口的概念,如图 2.14 所示。"
]
},
{
"cell_type": "markdown",
"id": "5a866b46",
"metadata": {},
"source": [
"**图 2.14 在从输入数据集创建多个批次的过程中,我们会在文本上滑动一个输入窗口。如果步长设定为 1那么在生成下一个批次时我们会将输入窗口向右移动 1 个位置。如果步长设定为输入窗口的大小,那么就可以避免批次之间的重叠。**"
]
},
{
"cell_type": "markdown",
"id": "3b942805",
"metadata": {},
"source": [
"![fig2.14](../img/fig-2-14.jpg)"
]
},
{
"cell_type": "markdown",
"id": "2b88ecbe",
"metadata": {},
"source": [
"**练习 2.2 用不同步长和上下文大小的数据加载器加载数据**"
]
},
{
"cell_type": "markdown",
"id": "93820358",
"metadata": {},
"source": [
"为了进一步理解数据加载器的工作原理,你可以尝试使用不同的设置运行它,例如设置`max_length=2`和`stride=2`,或者设置`max_length=8`和`stride=2`。"
]
},
{
"cell_type": "markdown",
"id": "d6104e53",
"metadata": {},
"source": [
"到目前为止,我们从数据加载器中采样的批次大小都是 1这对演示来说非常有帮助。如果你有深度学习的经验你可能知道较小的批次大小在训练过程中需要较少的内存但可能会导致模型更新时受到更多噪声的影响。就像在常规的深度学习中一样批次大小是一个需要在训练 LLM 时进行权衡和实验的超参数。"
]
},
{
"cell_type": "markdown",
"id": "5c392de9",
"metadata": {},
"source": [
"在我们深入探讨如何从 token ID 创建 Embedding 向量的最后两节内容之前,让我们先简要了解一下如何使用数据加载器进行批大小大于 1 的采样。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "344db195",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Inputs:\n",
" tensor([[ 306, 11, 475, 465],\n",
" [ 338, 523, 14348, 0],\n",
" [ 423, 284, 1394, 26148],\n",
" [ 465, 1182, 284, 804],\n",
" [ 355, 339, 3830, 612],\n",
" [22665, 4252, 10899, 13],\n",
" [ 13, 383, 18098, 373],\n",
" [ 683, 1207, 8344, 803]])\n",
"\n",
"Targets:\n",
" tensor([[ 11, 475, 465, 2951],\n",
" [ 523, 14348, 0, 632],\n",
" [ 284, 1394, 26148, 526],\n",
" [ 1182, 284, 804, 510],\n",
" [ 339, 3830, 612, 290],\n",
" [ 4252, 10899, 13, 198],\n",
" [ 383, 18098, 373, 3940],\n",
" [ 1207, 8344, 803, 306]])\n"
]
}
],
"source": [
"dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4)\n",
"data_iter = iter(dataloader)\n",
"inputs, targets = next(data_iter)\n",
"print(\"Inputs:\\n\", inputs)\n",
"print(\"\\nTargets:\\n\", targets)"
]
},
{
"cell_type": "markdown",
"id": "4cf19cea",
"metadata": {},
"source": [
"输出如下内容:"
]
},
{
"cell_type": "markdown",
"id": "3e9b66ba",
"metadata": {},
"source": [
"```\n",
"Inputs:\n",
" tensor([[ 306, 11, 475, 465],\n",
" [ 338, 523, 14348, 0],\n",
" [ 423, 284, 1394, 26148],\n",
" [ 465, 1182, 284, 804],\n",
" [ 355, 339, 3830, 612],\n",
" [22665, 4252, 10899, 13],\n",
" [ 13, 383, 18098, 373],\n",
" [ 683, 1207, 8344, 803]])\n",
"\n",
"Targets:\n",
" tensor([[ 11, 475, 465, 2951],\n",
" [ 523, 14348, 0, 632],\n",
" [ 284, 1394, 26148, 526],\n",
" [ 1182, 284, 804, 510],\n",
" [ 339, 3830, 612, 290],\n",
" [ 4252, 10899, 13, 198],\n",
" [ 383, 18098, 373, 3940],\n",
" [ 1207, 8344, 803, 306]])\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "3896585c",
"metadata": {},
"source": [
"请注意,我们将步长增加到了 4。这是为了充分利用数据集不会跳过任何一个词同时也避免了批次之间的重叠。过多的重叠可能会增加过拟合的风险。"
]
},
{
"cell_type": "markdown",
"id": "c3438e06",
"metadata": {},
"source": [
"在本章的接下来两节中,我们将实现 Embedding 层。其作用是将 token ID 转换为连续的向量表示,这将用作 LLM 的输入。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}