mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
465 lines
17 KiB
Plaintext
465 lines
17 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bae559a1",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 第四章 从头开始实现 GPT 模型以生成文本"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e3b02c54",
|
||
"metadata": {},
|
||
"source": [
|
||
"**本章介绍**:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "65964c57",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 编写类似 GPT 的大型语言模型 (LLM) 编码,该模型可以训练生成类似人类的文本 \n",
|
||
"- 规范化层激活以稳定神经网络训练 \n",
|
||
"- 在深度神经网络中添加快捷方式连接以更有效地训练模型 \n",
|
||
"- 实现 transformer 模块以创建各种大小的 GPT 模型 \n",
|
||
"- 计算 GPT 模型的参数数量和存储需求"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "73c209fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在上一章中,你学习并编写了多头注意力机制,这是 LLM 的核心组件之一。在本章中,我们现在将对 LLM 的其他构建块进行编码,并将它们组装成一个类似 GPT 的模型,我们将在下一章中训练该模型以生成类似人类的文本,如图 4.1 所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3c5efc3f",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.1 对 LLM 进行编码的三个主要阶段的心智模型,在通用文本数据集上预训练 LLM,并在标记数据集上对其进行微调。本章重点介绍如何实现 LLM 架构,我们将在下一章中对其进行培训。"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "5f4fdd61",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "783519e0",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t图 4.1 中引用的 LLM 架构由几个构建块组成,我们将在本章中实现这些构建块。在下一节中,我们将从模型架构的自上而下的视图开始,然后再更详细地介绍各个组件。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "708fa8b4",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.1 编写 LLM 架构"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1beee80f",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\tLLM,例如 GPT(代表 Generative Pretrained Transformer),是大型深度神经网络架构,旨在一次生成一个单词(或标记)的新文本。然而,尽管它们的规模很大,但模型架构并没有你想象的那么复杂,因为它的许多组件都是重复的,我们将在后面看到。图 4.2 提供了类似 GPT 的 LLM 的自上而下的视图,其中突出显示了其主要组件。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f58ad472",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.2 GPT 模型的心智模型。在嵌入层旁边,它由一个或多个变压器模块组成,其中包含我们在上一章中实现的掩蔽多头注意力模块。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ee7b74da",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f7237970",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t如图 4.2 所示,我们已经介绍了几个方面,例如输入标记化和嵌入,以及屏蔽的多头注意力模块。本章的重点将放在实现 GPT 模型的核心结构上,包括它的 transformer 模块,然后我们将在下一章中训练它以生成类似人类的文本。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "542d3ae9",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在前几章中,为了简单起见,我们使用了较小的嵌入维度,确保概念和示例可以舒适地放在一个页面上。现在,在本章中,我们将扩展到一个小型 GPT-2 模型的大小,特别是具有 1.24 亿个参数的最小版本,正如 Radford 等人的论文“语言模型是无监督的多任务学习者”中所描述的那样。请注意,虽然原始报告提到了 1.17 亿个参数,但后来已更正。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "401bb3ab",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t第 6 章将重点介绍如何将预训练的权重加载到我们的实现中,并将其调整为具有 345、762 和 15.42 亿个参数的大型 GPT-2 模型。在深度学习和 GPT 等 LLM 的上下文中,术语“参数”是指模型的可训练权重。这些权重本质上是模型的内部变量,在训练过程中进行调整和优化,以最小化特定的损失函数。这种优化允许模型从训练数据中学习。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b075ff94",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t例如,在由 2,048x2,048 维权重矩阵(或张量)表示的神经网络层中,该矩阵的每个元素都是一个参数。由于有 2,048 行和 2,048 列,因此该图层中的参数总数为 2,048 乘以 2,048,等于 4,194,304 个参数。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7a9a4bae",
|
||
"metadata": {},
|
||
"source": [
|
||
"**GPT-2 与 GPT-3**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3f0ffc50",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t请注意,我们之所以关注 GPT-2,是因为 OpenAI 已经公开了预训练模型的权重,我们将在第 6 章将其加载到我们的实现中。GPT-3 在模型架构方面基本相同,只是它从 GPT-2 的 15 亿个参数扩展到 GPT-3 的 1750 亿个参数,并且它使用更多的数据进行训练。在撰写本文时,GPT-3 的权重尚未公开。GPT-2 也是学习如何实现 LLM 的更好选择,因为它可以在一台笔记本电脑上运行,而 GPT-3 需要 GPU 集群进行训练和推理。根据 Lambda Labs 的数据,在单个 V100 数据中心 GPU 上训练 GPT-3 需要 355 年,在消费级 RTX 8000 GPU 上训练 GPT-3 需要 665 年。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "47fa26bc",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t我们通过以下 Python 字典指定小型 GPT-2 模型的配置,我们将在后面的代码示例中使用该字典:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "eb54f9ff",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"GPT_CONFIG_124M = {\n",
|
||
"\t\"vocab_size\": 50257, # Vocabulary size\n",
|
||
" \"context_length\": 1024, # Context length\n",
|
||
" \"emb_dim\": 768, # Embedding dimension\n",
|
||
" \"n_heads\": 12, # Number of attention heads\n",
|
||
" \"n_layers\": 12, # Number of layers\n",
|
||
" \"drop_rate\": 0.1, # Dropout rate\n",
|
||
" \"qkv_bias\": False # Query-Key-Value bias\n",
|
||
"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9c5bdac9",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在GPT_CONFIG_124M词典中,为了清楚起见,我们使用简洁的变量名称,并防止长代码行:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5578f3e3",
|
||
"metadata": {},
|
||
"source": [
|
||
"- “vocab_size”是指 50,257 个单词的词汇表,由第 2 章中的 BPE 分词器使用。\n",
|
||
"- “context_length”表示模型通过第 2 章中讨论的位置嵌入可以处理的最大输入标记数。\n",
|
||
"- “emb_dim”表示嵌入大小,将每个标记转换为 768 维向量。\n",
|
||
"- “n_heads”表示第3章中实现的多头注意力机制中的注意力头计数。\n",
|
||
"- “n_layers”指定模型中变压器块的数量,这将在后面的章节中详细阐述。\n",
|
||
"- “drop_rate”表示压差机制的强度(0.1 表示隐藏单位下降 10%),以防止过拟合,如第 3 章所述。\n",
|
||
"- “qkv_bias”确定是否在多头注意力的线性层中包含偏向量,以进行查询、键和值计算。按照现代 LLM 的规范,我们最初将禁用它,但当我们将 OpenAI 的预训练 GPT-2 权重加载到我们的模型中时,我们将在第 6 章中重新审视它。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fc5fd3bf",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t使用上面的配置,我们将通过实现本节中的 GPT 占位符架构 (DummyGPTModel) 来开始本章,如图 4.3 所示。这将为我们提供一个全局视图,了解所有内容如何组合在一起,以及我们需要在即将到来的部分中编写哪些其他组件来组装完整的 GPT 模型架构。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "67566077",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.3 一个心智模型,概述了我们对 GPT 架构进行编码的顺序。在本章中,我们将从 GPT 主干网(占位符架构)开始,然后再讨论各个核心部分,并最终将它们组装到最终 GPT 架构的 transformer 模块中。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "64a4b7c6",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dc685535",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t图 4.3 中所示的编号框说明了我们处理编码最终 GPT 架构所需的各个概念的顺序。我们将从第 1 步开始,一个占位符 GPT 主干,我们称之为 DummyGPTModel:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f1f01dad",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Listing 4.1 占位符 GPT 模型架构类**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1406a604",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import torch\n",
|
||
"import torch.nn as nn\n",
|
||
"class DummyGPTModel(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
|
||
" self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n",
|
||
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||
" self.trf_blocks = nn.Sequential(\n",
|
||
" *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])]) #A\n",
|
||
" self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"]) #B\n",
|
||
" self.out_head = nn.Linear(\n",
|
||
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
|
||
" )\n",
|
||
" def forward(self, in_idx):\n",
|
||
" batch_size, seq_len = in_idx.shape\n",
|
||
" tok_embeds = self.tok_emb(in_idx)\n",
|
||
" pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
|
||
" x = tok_embeds + pos_embeds\n",
|
||
" x = self.drop_emb(x)\n",
|
||
" x = self.trf_blocks(x)\n",
|
||
" x = self.final_norm(x)\n",
|
||
" logits = self.out_head(x)\n",
|
||
" return logits\n",
|
||
"class DummyTransformerBlock(nn.Module): #C\n",
|
||
" def __init__(self, cfg):\n",
|
||
" \tsuper().__init__()\n",
|
||
" def forward(self, x): #D\n",
|
||
" \treturn x\n",
|
||
"class DummyLayerNorm(nn.Module): #E\n",
|
||
" def __init__(self, normalized_shape, eps=1e-5): #F\n",
|
||
" \tsuper().__init__()\n",
|
||
" def forward(self, x):\n",
|
||
" \treturn x"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "69969d27",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t此代码中的 DummyGPTModel 类使用 PyTorch 的神经网络模块 (nn.模块)。DummyGPTModel 类中的模型架构由标记和位置嵌入、dropout、一系列转换器块 (DummyTransformerBlock)、最终层归一化 (DummyLayerNorm) 和线性输出层 (out_head) 组成。配置是通过 Python 字典传入的,例如,我们之前创建的 GPT_CONFIG_124M 字典。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fcc70ece",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\tforward 方法描述了通过模型的数据流:它计算输入索引的标记和位置嵌入,应用 dropout,通过 transformer 模块处理数据,应用归一化,最后使用线性输出层生成 logits。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1a13ef4b",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t上面的代码已经起作用了,我们将在本节后面准备输入数据后看到。但是,现在,请注意,在上面的代码中,我们已经使用了占位符(DummyLayerNorm 和 DummyTransformerBlock)来实现转换器块和层规范化,我们将在后面的章节中对其进行开发。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "98fee0c5",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t接下来,我们将准备输入数据并初始化一个新的 GPT 模型来说明它的用法。图 4.4 基于我们在第 2 章中看到的数字(我们对分词器进行编码)的基础上,提供了数据如何流入和流出 GPT 模型的高级概述。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "099417b4",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.4 显示如何标记、嵌入和馈送到 GPT 模型的输入数据的大图概述。请注意,在我们之前编码的 DummyGPTClass 中,令牌嵌入是在 GPT 模型中处理的。在 LLM 中,嵌入的输入令牌维度通常与输出维度匹配。此处的输出嵌入表示我们在第 3 章中讨论的上下文向量。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "32ed7f95",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c79dcc6a",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t为了实现图 4.4 中所示的步骤,我们使用第 2 章中介绍的 tiktoken 分词器对 GPT 模型的两个文本输入组成的批次进行分词化:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7ea35069",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import tiktoken\n",
|
||
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
|
||
"batch = []\n",
|
||
"txt1 = \"Every effort moves you\"\n",
|
||
"txt2 = \"Every day holds a\"\n",
|
||
"\n",
|
||
"batch.append(torch.tensor(tokenizer.encode(txt1)))\n",
|
||
"batch.append(torch.tensor(tokenizer.encode(txt2)))\n",
|
||
"batch = torch.stack(batch, dim=0)\n",
|
||
"print(batch)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7b93c468",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t两个文本的结果令牌 ID 如下所示:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2633edec",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tensor([[ 6109, 3626, 6100, 345], #A\n",
|
||
" [ 6109, 1110, 6622, 257]])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "272f3aaf",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t接下来,我们初始化一个新的 1.24 亿参数 DummyGPTModel 实例,并向其提供标记化的批处理:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a33ee9db",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"model = DummyGPTModel(GPT_CONFIG_124M)\n",
|
||
"logits = model(batch)\n",
|
||
"print(\"Output shape:\", logits.shape)\n",
|
||
"print(logits)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "416827b6",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t模型输出(通常称为 logit)如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "94a253a1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Output shape: torch.Size([2, 4, 50257])\n",
|
||
"tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],\n",
|
||
" [-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.2430],\n",
|
||
" [ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.5835],\n",
|
||
" [ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.0400]],\n",
|
||
" [[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.4530],\n",
|
||
" [-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.6621],\n",
|
||
" [ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.3717],\n",
|
||
" [-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.1814]]],\n",
|
||
" grad_fn=<UnsafeViewBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0038eead",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t输出张量有两行对应于两个文本样本。每个文本样本由 4 个标记组成;每个标记都是一个 50,257 维的向量,与标记器词汇表的大小相匹配。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c96a3546",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t嵌入有 50,257 个维度,因为每个维度都引用词汇表中的唯一标记。在本章的最后,当我们实现后处理代码时,我们将把这些 50,257 维的向量转换回标记 ID,然后我们可以将其解码为单词。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1ff3c403",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t现在,我们已经自上而下地了解了 GPT 架构及其 inand 输出,我们将在接下来的部分中对各个占位符进行编码,从实际层规范化类开始,该类将替换上一段代码中的 DummyLayerNorm。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|