mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-01-14 01:07:34 +08:00
1513 lines
82 KiB
Plaintext
1513 lines
82 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ce9295b2-182b-490b-8325-83a67c4a001d",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 章节 4:从零开始实现 GPT 模型"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在本章中,我们将设计一个类似 GPT 的大型语言模型(LLM)架构;下一章则将聚焦于该模型的训练。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7d4f11e0-4434-4979-9dee-e1207df0eb01",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/mental-model.webp\" width=450px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.1 设计LLM的架构"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 第1章探讨了如GPT与Llama等模型,这些模型基于transformer架构的decoder部分,并按顺序生成文本。\n",
|
||
"- 因此,这些LLM经常被称为decoder-only LLM。\n",
|
||
"- 与传统的深度学习模型相比,LLM更大,这是因为它们有更多的参数,而不是代码量。\n",
|
||
"- 而在LLM的架构中,有许多元素是重复的。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5c5213e9-bd1c-437e-aee8-f5e8fb717251",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/mental-model-2.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在前几章中,为了方便展示,我们使用了较小的嵌入(embedding)维度来处理token的输入和输出。\n",
|
||
"- 在本章中,我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n",
|
||
"- 我们将具体实现最小的GPT2-small模型(124M参数)的架构,如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样(注意,GPT2-small的参数量曾被错误的统计为117M参数,后被更正为124M)。\n",
|
||
"- 第6章将展示如何将预训练权重加载到我们实现的GPT2中,并兼容345、762和1542M参数的模型大小。\n",
|
||
"\n",
|
||
"> 译者注:GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量,这一错误后续在模型仓库中被偷偷修正了。\n",
|
||
"> \n",
|
||
"> 错误的参数量:Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n",
|
||
">\n",
|
||
"> 正确的参数量:Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 124M参数GPT-2模型的配置细节包括:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "5ed66875-1f24-445d-add6-006aae3c5707",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"GPT_CONFIG_124M = {\n",
|
||
" \"vocab_size\": 50257, # 词表大小\n",
|
||
" \"ctx_len\": 1024, # 上下文长度\n",
|
||
" \"emb_dim\": 768, # 嵌入维度\n",
|
||
" \"n_heads\": 12, # 注意力头(attention heads)的数量\n",
|
||
" \"n_layers\": 12, # 模型层数\n",
|
||
" \"drop_rate\": 0.1, # Dropout rate\n",
|
||
" \"qkv_bias\": False # Query-Key-Value bias\n",
|
||
"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 我们使用简短的变量名以避免后续代码行的过长\n",
|
||
"- \"vocab_size\" 是一个BPE tokenizer(分词器),词表大小为50257个词,这在第二章介绍过\n",
|
||
"- \"ctx_len\" 表示模型支持输入的最大token数量,这数值由第二章中介绍的位置编码决定\n",
|
||
"- \"emb_dim\" 是对输入token的嵌入维度,这里会将输入的每个token都嵌入成768维的向量\n",
|
||
"- \"n_heads\" 是多头注意力机制中的注意力头数,这在第三章中实现过\n",
|
||
"- \"n_layers\" 是模型中transformer blocks的数量,我们将在接下来的部分中实现它。\n",
|
||
"- \"drop_rate\" 是第三章中讨论的dropout机制的强度;0.1表示在训练期间丢弃10%的隐藏神经元以缓解过拟合\n",
|
||
"- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query(Q),Key(K)和Value(V)张量时是否应包含偏置向量(bias);当代LLM通常不会启用这个选项,我们也不会;但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时,会再次讨论此选项。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4adce779-857b-4418-9501-12a7f3818d88",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/chapter-steps.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"id": "619c2eed-f8ea-4ff5-92c3-feda0f29b227",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import torch.nn as nn\n",
|
||
"\n",
|
||
"\n",
|
||
"class DummyGPTModel(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
|
||
" self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
|
||
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||
" \n",
|
||
" # 先用空白实现顶替下 TransformerBlock\n",
|
||
" self.trf_blocks = nn.Sequential(\n",
|
||
" *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
|
||
" \n",
|
||
" # 先用空白实现顶替下 LayerNorm\n",
|
||
" self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
|
||
" self.out_head = nn.Linear(\n",
|
||
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
|
||
" )\n",
|
||
"\n",
|
||
" def forward(self, in_idx):\n",
|
||
" batch_size, seq_len = in_idx.shape\n",
|
||
" tok_embeds = self.tok_emb(in_idx)\n",
|
||
" pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
|
||
" x = tok_embeds + pos_embeds\n",
|
||
" x = self.drop_emb(x)\n",
|
||
" x = self.trf_blocks(x)\n",
|
||
" x = self.final_norm(x)\n",
|
||
" logits = self.out_head(x)\n",
|
||
" return logits\n",
|
||
"\n",
|
||
"\n",
|
||
"class DummyTransformerBlock(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" # 略\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" # 先啥也别干,原样返回\n",
|
||
" return x\n",
|
||
"\n",
|
||
"\n",
|
||
"class DummyLayerNorm(nn.Module):\n",
|
||
" def __init__(self, normalized_shape, eps=1e-5):\n",
|
||
" super().__init__()\n",
|
||
" # 这里的参数只是为了模拟 LayerNorm 接口。\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" # 先啥也别干,原样返回\n",
|
||
" return x"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9665e8ab-20ca-4100-b9b9-50d9bdee33be",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/gpt-in-out.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 61,
|
||
"id": "794b6b6c-d36f-411e-a7db-8ac566a87fee",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[6109, 3626, 6100, 345],\n",
|
||
" [6109, 1110, 6622, 257]])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import tiktoken\n",
|
||
"import torch\n",
|
||
"\n",
|
||
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
|
||
"\n",
|
||
"batch = []\n",
|
||
"\n",
|
||
"txt1 = \"Every effort moves you\"\n",
|
||
"txt2 = \"Every day holds a\"\n",
|
||
"\n",
|
||
"batch.append(torch.tensor(tokenizer.encode(txt1)))\n",
|
||
"batch.append(torch.tensor(tokenizer.encode(txt2)))\n",
|
||
"batch = torch.stack(batch, dim=0)\n",
|
||
"print(batch)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 62,
|
||
"id": "009238cd-0160-4834-979c-309710986bb0",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Output shape: torch.Size([2, 4, 50257])\n",
|
||
"tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],\n",
|
||
" [-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.2430],\n",
|
||
" [ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.5835],\n",
|
||
" [ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.0400]],\n",
|
||
"\n",
|
||
" [[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.4530],\n",
|
||
" [-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.6621],\n",
|
||
" [ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.3717],\n",
|
||
" [-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.1814]]],\n",
|
||
" grad_fn=<UnsafeViewBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"model = DummyGPTModel(GPT_CONFIG_124M)\n",
|
||
"\n",
|
||
"logits = model(batch)\n",
|
||
"print(\"Output shape:\", logits.shape)\n",
|
||
"print(logits)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f8332a00-98da-4eb4-b882-922776a89917",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.2 对激活进行层归一化"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 层归一化(Layer normalization),也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)),会将神经网络层的激活值规范到均值为0,并将其方差归一化为1。\n",
|
||
"- 这稳定了训练过程,并提高了模型的收敛速度。。\n",
|
||
"- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm,一会会实现它;同时,在最终输出层之前也会应用LayerNorm。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "314ac47a-69cc-4597-beeb-65bed3b5910f",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/layernorm.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5ab49940-6b35-4397-a80e-df8d092770a7",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 咱们用一个简单的网络,输入一个样本看看LayerNorm是怎么工作的。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "79e1b463-dc3f-44ac-9cdb-9d5b6f64eb9d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],\n",
|
||
" [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],\n",
|
||
" grad_fn=<ReluBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"\n",
|
||
"# 创建两个训练样例,每个样例有5个维度(特征)\n",
|
||
"batch_example = torch.randn(2, 5) \n",
|
||
"\n",
|
||
"layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
|
||
"out = layer(batch_example)\n",
|
||
"print(out)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 计算上面两个输入的均值和方差:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"id": "9888f79e-8e69-44aa-8a19-cd34292adbf5",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Mean:\n",
|
||
" tensor([[0.1324],\n",
|
||
" [0.2170]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[0.0231],\n",
|
||
" [0.0398]], grad_fn=<VarBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"mean = out.mean(dim=-1, keepdim=True)\n",
|
||
"var = out.var(dim=-1, keepdim=True)\n",
|
||
"\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "052eda3e-b395-48c4-acd4-eb8083bab958",
|
||
"metadata": {},
|
||
"source": [
|
||
"- LayerNorm 会对输入样本分别归一化(下图中的行); 使用`dim=-1`是在最后一个维度(特征维度)而不是行维度(样本数)上进行计算"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "570db83a-205c-4f6f-b219-1f6195dde1a7",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/layernorm2.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 减去均值并除以方差的平方根(标准差)会使输入在列(特征)维度上的均值为0,方差为1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"id": "9a1d1bb9-3341-4c9a-bc2a-d2489bf89cda",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Normalized layer outputs:\n",
|
||
" tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],\n",
|
||
" [-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],\n",
|
||
" grad_fn=<DivBackward0>)\n",
|
||
"Mean:\n",
|
||
" tensor([[ 0.0000],\n",
|
||
" [ 0.0000]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.],\n",
|
||
" [1.]], grad_fn=<VarBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"out_norm = (out - mean) / torch.sqrt(var)\n",
|
||
"print(\"Normalized layer outputs:\\n\", out_norm)\n",
|
||
"\n",
|
||
"mean = out_norm.mean(dim=-1, keepdim=True)\n",
|
||
"var = out_norm.var(dim=-1, keepdim=True)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 每个输入的均值都为0,方差都为1;为了提高可读性,我们可以关闭PyTorch的科学计数法:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "3e06c34b-c68a-4b36-afbe-b30eda4eca39",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Mean:\n",
|
||
" tensor([[ 0.0000],\n",
|
||
" [ 0.0000]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.],\n",
|
||
" [1.]], grad_fn=<VarBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.set_printoptions(sci_mode=False)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在上面,我们对每个输入的特征进行了归一化\n",
|
||
"- 现在,用相同的思路,我们可以实现一个`LayerNorm`类:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"id": "3333a305-aa3d-460a-bcce-b80662d464d9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class LayerNorm(nn.Module):\n",
|
||
" def __init__(self, emb_dim):\n",
|
||
" super().__init__()\n",
|
||
" self.eps = 1e-5\n",
|
||
" self.scale = nn.Parameter(torch.ones(emb_dim))\n",
|
||
" self.shift = nn.Parameter(torch.zeros(emb_dim))\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" mean = x.mean(dim=-1, keepdim=True)\n",
|
||
" var = x.var(dim=-1, keepdim=True, unbiased=False)\n",
|
||
" norm_x = (x - mean) / torch.sqrt(var + self.eps)\n",
|
||
" return self.scale * norm_x + self.shift"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e56c3908-7544-4808-b8cb-5d0a55bcca72",
|
||
"metadata": {},
|
||
"source": [
|
||
"**缩放和偏移**\n",
|
||
"- 注意,除了通过减去均值并除以方差执行归一化之外,我们还添加了两个可训练参数,一个是 `scale`,另一个是 `shift`。\n",
|
||
"- 初始的 scale(乘以1)和 shift(加0)值没有任何效果;然而,scale 和 shift 是可训练的参数,如果确定这样做可以改善模型在训练任务上的性能,LLM 在训练过程中会自动调整它们。\n",
|
||
"- 这使得模型能够学习适合其处理数据的适当缩放和偏移。\n",
|
||
"- 注意,在计算方差的平方根之前,我们还添加了一个较小的值(eps);这是为了避免在方差为0时发生分母为0的问题。\n",
|
||
"\n",
|
||
"**有偏方差**\n",
|
||
"- 在上面的方差计算中,设置 `unbiased=False` 意味着用 $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ 来计算方差,其中 n 是样本大小(在这里是特征或列数);这个公式不包括 Bessel 修正(分母是 n-1),因此得到的方差是有偏估计。\n",
|
||
"- 因为LLM的嵌入维度很高,所以使用 n 或 n-1 (有偏或无偏)的区别不大。\n",
|
||
"- 但 GPT-2 在LayerNorm中使用了有偏方差进行训练,为了在后续章节能加载现有的预训练权重,咱需要`unbiased`这个变量做兼容。\n",
|
||
"\n",
|
||
"- 下面手动实现下 LayerNorm:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"id": "23b1000a-e613-4b43-bd90-e54deed8d292",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"ln = LayerNorm(emb_dim=5)\n",
|
||
"out_ln = ln(batch_example)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"id": "94c12de2-1cab-46e0-a099-e2e470353bff",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Mean:\n",
|
||
" tensor([[ -0.0000],\n",
|
||
" [ 0.0000]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.0000],\n",
|
||
" [1.0000]], grad_fn=<VarBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"mean = out_ln.mean(dim=-1, keepdim=True)\n",
|
||
"var = out_ln.var(dim=-1, unbiased=False, keepdim=True)\n",
|
||
"\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e136cfc4-7c89-492e-b120-758c272bca8c",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/overview-after-ln.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "11190e7d-8c29-4115-824a-e03702f9dd54",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.3 使用GELU激活函数实现前馈神经网络"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在这一节中,我们将实现一个网络子模块,该模块将作为LLM中Transformer block的一部分\n",
|
||
"- 我们从激活函数开始\n",
|
||
"- 在深度学习中,由于ReLU(Rectified Linear Unit)激活函数在各种神经网络架构中的简单性和有效性,它们经常被使用\n",
|
||
"- 在LLM中,除了ReLU之外,还使用了其他类型的激活函数;其中两个值得注意的例子是GELU(Gaussian Error Linear Unit)和SwiGLU(Sigmoid-Weighted Linear Unit)\n",
|
||
"- GELU和SwiGLU是更复杂的、平滑的激活函数,它们分别结合了高斯和Sigmoid门控线性单元,为深度学习模型提供了更好的性能,与ReLU的简单分段线性函数不同"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
|
||
"metadata": {},
|
||
"source": [
|
||
"- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现;其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$,其中$\\phi(x)$是标准高斯分布的累积分布函数。\n",
|
||
"- 在实际应用中,常常采用计算成本较低的近似形式:$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$(原始的GPT-2模型也是使用这个近似形式进行训练的)。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"id": "f84694b7-95f3-4323-b6d6-0a73df278e82",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class GELU(nn.Module):\n",
|
||
" def __init__(self):\n",
|
||
" super().__init__()\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" return 0.5 * x * (1 + torch.tanh(\n",
|
||
" torch.sqrt(torch.tensor(2.0 / torch.pi)) * \n",
|
||
" (x + 0.044715 * torch.pow(x, 3))\n",
|
||
" ))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"id": "fc5487d2-2576-4118-80a7-56c4caac2e71",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 800x300 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"gelu, relu = GELU(), nn.ReLU()\n",
|
||
"\n",
|
||
"# Some sample data\n",
|
||
"x = torch.linspace(-3, 3, 100)\n",
|
||
"y_gelu, y_relu = gelu(x), relu(x)\n",
|
||
"\n",
|
||
"plt.figure(figsize=(8, 3))\n",
|
||
"for i, (y, label) in enumerate(zip([y_gelu, y_relu], [\"GELU\", \"ReLU\"]), 1):\n",
|
||
" plt.subplot(1, 2, i)\n",
|
||
" plt.plot(x, y)\n",
|
||
" plt.title(f\"{label} activation function\")\n",
|
||
" plt.xlabel(\"x\")\n",
|
||
" plt.ylabel(f\"{label}(x)\")\n",
|
||
" plt.grid(True)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 显然,ReLU是一个分段线性函数,如果输入是正值,它直接原样输出;否则,输出为零。\n",
|
||
"- GELU是一个平滑的非线性函数,近似于ReLU,但输入为负值时,梯度不为0。\n",
|
||
"- 接下来,让我们实现小型神经网络模块 FeedForward,稍后我们将在LLM的Transformer block中使用它:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"id": "9275c879-b148-4579-a107-86827ca14d4d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class FeedForward(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" self.layers = nn.Sequential(\n",
|
||
" nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n",
|
||
" GELU(),\n",
|
||
" nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n",
|
||
" nn.Dropout(cfg[\"drop_rate\"])\n",
|
||
" )\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" return self.layers(x)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"id": "7c4976e2-0261-418e-b042-c5be98c2ccaf",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"768\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(GPT_CONFIG_124M[\"emb_dim\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fdcaacfa-3cfc-4c9e-b668-b71a2753145a",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/ffn.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"id": "928e7f7c-d0b1-499f-8d07-4cadb428a6f9",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"torch.Size([2, 3, 768])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"ffn = FeedForward(GPT_CONFIG_124M)\n",
|
||
"\n",
|
||
"# input shape: [batch_size, num_token, emb_size]\n",
|
||
"x = torch.rand(2, 3, 768) \n",
|
||
"out = ffn(x)\n",
|
||
"print(out.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8f8756c5-6b04-443b-93d0-e555a316c377",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/mental-model-3.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4ffcb905-53c7-4886-87d2-4464c5fecf89",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.4 添加Shortcut连接"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5161bf8c",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 接下来,我们将探讨shortcut连接,这也被称为跳跃连接或残差连接\n",
|
||
"- 最初,shortcut连接在计算机视觉的深度神经网络(残差网络)中被提出,以缓解消失梯度问题\n",
|
||
"- Shortcut连接为网络中传播的梯度提供了一条更短的路径\n",
|
||
"- 这是通过将一个层的输出加到后面层的输出上来实现,通常跳过中间的一个或多个层\n",
|
||
"- 让我们通过一个小的示例网络来说明这个思想:\n",
|
||
"\n",
|
||
"<img src=\"figures/shortcut-example.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "14cfd241-a32e-4601-8790-784b82f2f23e",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 示例代码如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "05473938-799c-49fd-86d4-8ed65f94fee6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class ExampleDeepNeuralNetwork(nn.Module):\n",
|
||
" def __init__(self, layer_sizes, use_shortcut):\n",
|
||
" super().__init__()\n",
|
||
" self.use_shortcut = use_shortcut\n",
|
||
" self.layers = nn.ModuleList([\n",
|
||
" nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),\n",
|
||
" nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),\n",
|
||
" nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),\n",
|
||
" nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),\n",
|
||
" nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())\n",
|
||
" ])\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" for layer in self.layers:\n",
|
||
" # 计算当前层的输出\n",
|
||
" layer_output = layer(x)\n",
|
||
" # 检查是否可以使用shortcut\n",
|
||
" if self.use_shortcut and x.size() == layer_output.size():\n",
|
||
" x = x + layer_output\n",
|
||
" else:\n",
|
||
" x = layer_output\n",
|
||
" return x\n",
|
||
"\n",
|
||
"\n",
|
||
"def print_gradients(model, x):\n",
|
||
" # 前向传播\n",
|
||
" output = model(x)\n",
|
||
" target = torch.tensor([[0.]])\n",
|
||
"\n",
|
||
" # 根据输出和标签差距来计算损失\n",
|
||
" loss = nn.MSELoss()\n",
|
||
" loss = loss(output, target)\n",
|
||
" \n",
|
||
" # 反向传播计算梯度\n",
|
||
" loss.backward()\n",
|
||
"\n",
|
||
" for name, param in model.named_parameters():\n",
|
||
" if 'weight' in name:\n",
|
||
" # 打印权重的平均绝对梯度\n",
|
||
" print(f\"{name} has gradient mean of {param.grad.abs().mean().item()}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b39bf277-b3db-4bb1-84ce-7a20caff1011",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 让我们先打印**不使用**shortcut连接的梯度值:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c75f43cc-6923-4018-b980-26023086572c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"layers.0.0.weight has gradient mean of 0.00020173587836325169\n",
|
||
"layers.1.0.weight has gradient mean of 0.0001201116101583466\n",
|
||
"layers.2.0.weight has gradient mean of 0.0007152041653171182\n",
|
||
"layers.3.0.weight has gradient mean of 0.001398873864673078\n",
|
||
"layers.4.0.weight has gradient mean of 0.005049646366387606\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"layer_sizes = [3, 3, 3, 3, 3, 1] \n",
|
||
"\n",
|
||
"sample_input = torch.tensor([[1., 0., -1.]])\n",
|
||
"\n",
|
||
"torch.manual_seed(123)\n",
|
||
"model_without_shortcut = ExampleDeepNeuralNetwork(\n",
|
||
" layer_sizes, use_shortcut=False\n",
|
||
")\n",
|
||
"print_gradients(model_without_shortcut, sample_input)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "837fd5d4-7345-4663-97f5-38f19dfde621",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 接下来我们打印**使用**shortcut连接的梯度值"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "11b7c0c2-f9dd-4dd5-b096-a05c48c5f6d6",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"layers.0.0.weight has gradient mean of 0.22169792652130127\n",
|
||
"layers.1.0.weight has gradient mean of 0.20694105327129364\n",
|
||
"layers.2.0.weight has gradient mean of 0.32896995544433594\n",
|
||
"layers.3.0.weight has gradient mean of 0.2665732502937317\n",
|
||
"layers.4.0.weight has gradient mean of 1.3258541822433472\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"model_with_shortcut = ExampleDeepNeuralNetwork(\n",
|
||
" layer_sizes, use_shortcut=True\n",
|
||
")\n",
|
||
"print_gradients(model_with_shortcut, sample_input)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b385c50b",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 从上述输出可以看出,shortcut连接可以防止梯度在浅层(靠近layer.0)中消失。\n",
|
||
"- 接下来,我们将在实现Transformer块时应用shortcut连接。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd8a2072",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.5 在transformer块中连接注意力层和线性层"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bc571b76",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 本节将前述概念融合,搭建transformer块。\n",
|
||
"- Transformer块将前一章的因果多头注意力模块与线性层结合起来,即之前章节中我们实现的前馈神经网络\n",
|
||
"- 此外,transformer块还使用了Dropout和shortcut连接。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"id": "0e1e8176-e5e3-4152-b1aa-0bbd7891dfd9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from previous_chapters import MultiHeadAttention\n",
|
||
"\n",
|
||
"\n",
|
||
"class TransformerBlock(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" self.att = MultiHeadAttention(\n",
|
||
" d_in=cfg[\"emb_dim\"],\n",
|
||
" d_out=cfg[\"emb_dim\"],\n",
|
||
" block_size=cfg[\"ctx_len\"],\n",
|
||
" num_heads=cfg[\"n_heads\"], \n",
|
||
" dropout=cfg[\"drop_rate\"],\n",
|
||
" qkv_bias=cfg[\"qkv_bias\"])\n",
|
||
" self.ff = FeedForward(cfg)\n",
|
||
" self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
|
||
" self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
|
||
" self.drop_resid = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" # 注意力块中的Shortcut连接\n",
|
||
" shortcut = x\n",
|
||
" x = self.norm1(x)\n",
|
||
" x = self.att(x) # Shape [batch_size, num_tokens, emb_size]\n",
|
||
" x = self.drop_resid(x)\n",
|
||
" x = x + shortcut # 与原始输入块求和\n",
|
||
"\n",
|
||
" # 前馈块中的Shortcut连接\n",
|
||
" shortcut = x\n",
|
||
" x = self.norm2(x)\n",
|
||
" x = self.ff(x)\n",
|
||
" x = self.drop_resid(x)\n",
|
||
" x = x + shortcut # 与原始输入块求和\n",
|
||
"\n",
|
||
" return x"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "36b64d16-94a6-4d13-8c85-9494c50478a9",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/transformer-block.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "31d3dd26",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 假设我们有2个输入样本,每个样本包含6个token,且每个token都是一个768维的embedding向量。此时,Transformer块会对输入进行自注意力计算,接着进行线性变换,得到一个与输入形状相同的输出。\n",
|
||
"- 我们可以将这个输出视为前一章中所讨论的上下文向量的增强版本。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 64,
|
||
"id": "3fb45a63-b1f3-4b08-b525-dafbc8228405",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Input shape: torch.Size([2, 4, 768])\n",
|
||
"Output shape: torch.Size([2, 4, 768])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"\n",
|
||
"x = torch.rand(2, 4, 768) # Shape: [batch_size, num_tokens, emb_dim]\n",
|
||
"block = TransformerBlock(GPT_CONFIG_124M)\n",
|
||
"output = block(x)\n",
|
||
"\n",
|
||
"print(\"Input shape:\", x.shape)\n",
|
||
"print(\"Output shape:\", output.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"id": "01e737a6-fc99-42bb-9f7e-4da899168811",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Input shape: torch.Size([2, 4, 768])\n",
|
||
"Output shape: torch.Size([2, 4, 768])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"\n",
|
||
"x = torch.rand(2, 4, 768) # Shape: [batch_size, num_tokens, emb_dim]\n",
|
||
"block = TransformerBlock(GPT_CONFIG_124M)\n",
|
||
"output = block(x)\n",
|
||
"\n",
|
||
"print(\"Input shape:\", x.shape)\n",
|
||
"print(\"Output shape:\", output.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "91f502e4-f3e4-40cb-8268-179eec002394",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/mental-model-final.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "46618527-15ac-4c32-ad85-6cfea83e006e",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.6 编写GPT模型"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b8a75745",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 我们已经接近成功了:现在让我们将transformer块集成到我们在本章开头编写的架构中,以便获得功能强大的GPT架构\n",
|
||
"- 请注意,transformer块被重复多次使用;在最小的124M GPT-2模型中,我们重复了12次:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9b7b362d-f8c5-48d2-8ebd-722480ac5073",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/gpt.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "324e4b5d-ed89-4fdf-9a52-67deee0593bc",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 对应的代码实现,其中 `cfg[\"n_layers\"] = 12`:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"id": "c61de39c-d03c-4a32-8b57-f49ac3834857",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class GPTModel(nn.Module):\n",
|
||
" def __init__(self, cfg):\n",
|
||
" super().__init__()\n",
|
||
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
|
||
" self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
|
||
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||
" \n",
|
||
" self.trf_blocks = nn.Sequential(\n",
|
||
" *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
|
||
" \n",
|
||
" self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
|
||
" self.out_head = nn.Linear(\n",
|
||
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
|
||
" )\n",
|
||
"\n",
|
||
" def forward(self, in_idx):\n",
|
||
" batch_size, seq_len = in_idx.shape\n",
|
||
" tok_embeds = self.tok_emb(in_idx)\n",
|
||
" pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
|
||
" x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]\n",
|
||
" x = self.trf_blocks(x)\n",
|
||
" x = self.final_norm(x)\n",
|
||
" logits = self.out_head(x)\n",
|
||
" return logits"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "86571328",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 我们现在可以按照如下方式,采用124M参数模型的配置,以随机初始化权重的方式实例化这个GPT模型"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"id": "252b78c2-4404-483b-84fe-a412e55c16fc",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Input batch:\n",
|
||
" tensor([[6109, 3626, 6100, 345],\n",
|
||
" [6109, 1110, 6622, 257]])\n",
|
||
"\n",
|
||
"Output shape: torch.Size([2, 4, 50257])\n",
|
||
"tensor([[[-0.0055, 0.3224, 0.2185, ..., 0.2539, 0.4578, -0.4747],\n",
|
||
" [ 0.2663, -0.2975, -0.5040, ..., -0.3903, 0.5328, -0.4224],\n",
|
||
" [ 1.1146, -0.0923, 0.1303, ..., 0.1521, -0.4494, 0.0276],\n",
|
||
" [-0.8239, 0.1174, -0.2566, ..., 1.1197, 0.1036, -0.3993]],\n",
|
||
"\n",
|
||
" [[-0.1027, 0.1752, -0.1048, ..., 0.2258, 0.1559, -0.8747],\n",
|
||
" [ 0.2230, 0.1246, 0.0492, ..., 0.8573, -0.2933, 0.3036],\n",
|
||
" [ 0.9409, 1.3068, -0.1610, ..., 0.8244, 0.1763, 0.0811],\n",
|
||
" [ 0.4395, 0.2753, 0.1540, ..., 1.3410, -0.3709, 0.1643]]],\n",
|
||
" grad_fn=<UnsafeViewBackward0>)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"model = GPTModel(GPT_CONFIG_124M)\n",
|
||
"\n",
|
||
"out = model(batch)\n",
|
||
"print(\"Input batch:\\n\", batch)\n",
|
||
"print(\"\\nOutput shape:\", out.shape)\n",
|
||
"print(out)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "af09a24f",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 我们将在下一章对这个模型进行训练。\n",
|
||
"- 这里对模型大小做一个快速说明:我们之前提到它是一个拥有124M参数的模型;可以按照以下方式核对这个数字:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"id": "84fb8be4-9d3b-402b-b3da-86b663aac33a",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Total number of parameters: 163,009,536\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"total_params = sum(p.numel() for p in model.parameters())\n",
|
||
"print(f\"Total number of parameters: {total_params:,}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1160952b",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 正如我们看到的,这个模型的参数量为163M个,而不是124M个;为什么呢?\n",
|
||
"- 在原始的GPT-2论文中,研究人员使用了权重绑定,这意味着他们将token嵌入层(tok_emb)重复用作输出层,即设置`self.out_head.weight = self.tok_emb.weight`\n",
|
||
"- token嵌入层将50,257维输入token的one-hot编码投影到768维的embedding表示中\n",
|
||
"- 输出层将768维的embedding投影回到50,257维的表示中,以便我们可以将其转换回单词(更多关于此的信息请参见下一节)\n",
|
||
"- 因此,embedding层和输出层有相同数量的权重参数,正如我们根据其权重矩阵的形状所看到的那样"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"id": "e3b43233-e9b8-4f5a-b72b-a263ec686982",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Token embedding layer shape: torch.Size([50257, 768])\n",
|
||
"Output layer shape: torch.Size([50257, 768])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"Token embedding layer shape:\", model.tok_emb.weight.shape)\n",
|
||
"print(\"Output layer shape:\", model.out_head.weight.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "029a0dc9",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在原始的GPT-2论文中,研究人员将标记嵌入矩阵重复用作输出矩阵\n",
|
||
"- 因此,如果我们减去输出层的参数数量,就会得到一个124M参数的模型:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"id": "95a22e02-50d3-48b3-a4e0-d9863343c164",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of trainable parameters considering weight tying: 124,412,160\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())\n",
|
||
"print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "db1e245d",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 在实践中,我发现在没有权重绑定时训练模型更容易,这就是为什么在这里我们没有实现它的原因。\n",
|
||
"- 然而,在第六章加载预训练权重时,我们将重新审视并应用这个权重绑定的想法。\n",
|
||
"- 最后,我们可以按以下方式计算模型的内存需求,这可以作为一个有用的参考点:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"id": "5131a752-fab8-4d70-a600-e29870b33528",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Total size of the model: 621.83 MB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 计算总字节大小(假设每个参数均为占用4个字节的float32类型) \n",
|
||
"total_size_bytes = total_params * 4\n",
|
||
"\n",
|
||
"# 转换为兆字节(MB)\n",
|
||
"total_size_mb = total_size_bytes / (1024 * 1024)\n",
|
||
"\n",
|
||
"print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "309a3be4-c20a-4657-b4e0-77c97510b47c",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 练习:你可以尝试实现以下其他配置,这些配置也在 [GPT-2 论文](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中提到.\n",
|
||
"\n",
|
||
" - **GPT2-small** (我们已经实现的124M参数配置):\n",
|
||
" - \"emb_dim\" = 768\n",
|
||
" - \"n_layers\" = 12\n",
|
||
" - \"n_heads\" = 12\n",
|
||
"\n",
|
||
" - **GPT2-medium:**\n",
|
||
" - \"emb_dim\" = 1024\n",
|
||
" - \"n_layers\" = 24\n",
|
||
" - \"n_heads\" = 16\n",
|
||
" \n",
|
||
" - **GPT2-large:**\n",
|
||
" - \"emb_dim\" = 1280\n",
|
||
" - \"n_layers\" = 36\n",
|
||
" - \"n_heads\" = 20\n",
|
||
" \n",
|
||
" - **GPT2-XL:**\n",
|
||
" - \"emb_dim\" = 1600\n",
|
||
" - \"n_layers\" = 48\n",
|
||
" - \"n_heads\" = 25"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "da5d9bc0-95ab-45d4-9378-417628d86e35",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.7 生成文本"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "48da5deb-6ee0-4b9b-8dd2-abed7ed65172",
|
||
"metadata": {},
|
||
"source": [
|
||
"- LLMs(如我们上面实现的GPT模型)一次生成一个单词。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "caade12a-fe97-480f-939c-87d24044edff",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/iterative-gen.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4d933457",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 下面的 `generate_text_simple` 函数实现了贪婪解码,这是一种简单快速的文本生成方法\n",
|
||
"- 在贪婪解码中,模型在每一步都选择概率最高的单词(或 token)作为其下一个输出(最高的 logits 输出对应于最高的概率,所以我们甚至不需要显式地计算 softmax 函数)\n",
|
||
"- 在下一章中,我们将实现一个更高级的 `generate_text` 函数\n",
|
||
"- 下图描述了 GPT 模型如何在给定输入上下文的情况下生成下一个单词 token"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7ee0f32c-c18c-445e-b294-a879de2aa187",
|
||
"metadata": {},
|
||
"source": [
|
||
"<img src=\"figures/generate-text.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"id": "c9b428a9-8764-4b36-80cd-7d4e00595ba6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
|
||
" # idx是当前上下文中的索引数组,形状为(B, T)\n",
|
||
" for _ in range(max_new_tokens):\n",
|
||
"\n",
|
||
" # 如果当前上下文超过了支持的长度,就对当前上下文进行截断\n",
|
||
" # 例如,如果LLM只支持5个token,而上下文长度为10,\n",
|
||
" # 那么只有最后5个token会被用作上下文\n",
|
||
"\n",
|
||
" idx_cond = idx[:, -context_size:]\n",
|
||
" \n",
|
||
" # 获取预测结果\n",
|
||
" with torch.no_grad():\n",
|
||
" logits = model(idx_cond)\n",
|
||
" \n",
|
||
" # 只关注最后一个时间步\n",
|
||
" # (batch, n_token, vocab_size)变为(batch, vocab_size)\n",
|
||
" logits = logits[:, -1, :] \n",
|
||
"\n",
|
||
" # 通过softmax函数获得对应的概率\n",
|
||
" probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)\n",
|
||
"\n",
|
||
" # 获取概率值最高的单词索引\n",
|
||
" idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)\n",
|
||
"\n",
|
||
" # 将采样到的索引添加到当前运行的上下文索引序列中\n",
|
||
" idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)\n",
|
||
"\n",
|
||
" return idx"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6515f2c1-3cc7-421c-8d58-cc2f563b7030",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 上述的 `generate_text_simple` 函数实现了一次迭代过程,它一次生成一个token。\n",
|
||
"\n",
|
||
"<img src=\"figures/iterative-generate.webp\" width=350px>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f682eac4-f9bd-438b-9dec-6b1cc7bc05ce",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 让我们准备一个输入示例:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"id": "bb3ffc8e-f95f-4a24-a978-939b8953ea3e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([-1.4929, 4.4812, -1.6093], grad_fn=<SliceBackward0>)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"tensor([ 0.0000, 0.0012, 0.0000, ..., 0.0000, 0.0000,\n",
|
||
" 0.0000], grad_fn=<SoftmaxBackward0>)"
|
||
]
|
||
},
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"b = logits[0, -1, :]\n",
|
||
"b[0] = -1.4929\n",
|
||
"b[1] = 4.4812\n",
|
||
"b[2] = -1.6093\n",
|
||
"\n",
|
||
"print(b[:3])\n",
|
||
"torch.softmax(b, dim=0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"id": "3d7e3e94-df0f-4c0f-a6a1-423f500ac1d3",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"encoded: [15496, 11, 314, 716]\n",
|
||
"encoded_tensor.shape: torch.Size([1, 4])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"start_context = \"Hello, I am\"\n",
|
||
"\n",
|
||
"encoded = tokenizer.encode(start_context)\n",
|
||
"print(\"encoded:\", encoded)\n",
|
||
"\n",
|
||
"encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n",
|
||
"print(\"encoded_tensor.shape:\", encoded_tensor.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"id": "a72a9b60-de66-44cf-b2f9-1e638934ada4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])\n",
|
||
"Output length: 10\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"model.eval() # 关闭 dropout\n",
|
||
"\n",
|
||
"out = generate_text_simple(\n",
|
||
" model=model,\n",
|
||
" idx=encoded_tensor, \n",
|
||
" max_new_tokens=6, \n",
|
||
" context_size=GPT_CONFIG_124M[\"ctx_len\"]\n",
|
||
")\n",
|
||
"\n",
|
||
"print(\"Output:\", out)\n",
|
||
"print(\"Output length:\", len(out[0]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1d131c00-1787-44ba-bec3-7c145497b2c3",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 移除批次维度并转回文本:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"id": "053d99f6-5710-4446-8d52-117fb34ea9f6",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Hello, I am Featureiman Byeswickattribute argue\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n",
|
||
"print(decoded_text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "31806429",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 请注意,该模型尚未训练;因此上述文本是随机生成的\n",
|
||
"- 我们将在下一章训练这个模型"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|