mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
462 lines
15 KiB
Plaintext
462 lines
15 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f1baec45",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.2 使用层归一化对激活进行归一化"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0fa08967",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t由于梯度消失或爆炸等问题,训练具有多层的深度神经网络有时可能具有挑战性。这些问题导致训练动态不稳定,使网络难以有效调整其权重,这意味着学习过程很难为神经网络找到一组参数(权重),以最小化损失函数。换句话说,网络很难在一定程度上学习数据中的基本模式,从而无法做出准确的预测或决策。(如果您不熟悉神经网络训练和梯度概念,可以在附录 A 中的第 A.4 节 “轻松实现自动区分:PyTorch 简介”中找到这些概念的简要介绍。但是,不需要对梯度有深入的数学理解才能遵循本书的内容。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f53ecf2b",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在本节中,我们将实现层归一化,以提高神经网络训练的稳定性和效率。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5fe24ad6",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t层归一化背后的主要思想是调整神经网络层的激活(输出),使其平均值为 0,方差为 1,也称为单位方差。这种调整加快了向有效重量的收敛速度,并确保了一致、可靠的训练。正如我们在上一节中看到的,基于 DummyLayerNorm 占位符,在 GPT-2 和现代 transformer 架构中,层归一化通常在多头注意力模块之前和之后以及最终输出层之前应用。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9459b0b0",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在代码中实现层归一化之前,图 4.5 直观地概述了层归一化的工作原理。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1101131d",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.5 层归一化的图示,其中 5 层输出(也称为激活)被归一化,使其均值为零,方差为 1。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f4338794",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "43a1ba1c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t我们可以通过以下代码重新创建图 4.5 中所示的示例,其中我们实现了一个具有 5 个输入和 6 个输出的神经网络层,我们将其应用于两个输入示例:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b69a76c2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"torch.manual_seed(123)\n",
|
||
"batch_example = torch.randn(2, 5) #A\n",
|
||
"layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
|
||
"out = layer(batch_example)\n",
|
||
"print(out)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d7901a3c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t这将打印以下张量,其中第一行列出了第一个输入的层输出,第二行列出了第二行的层输出:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8170493d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],\n",
|
||
" [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],\n",
|
||
" grad_fn=<ReluBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ae2b8b15",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t我们编码的神经网络层由一个线性层和一个非线性激活函数 ReLU(Rectified Linear Unit 的缩写)组成,它是神经网络中的标准激活函数。如果您不熟悉 ReLU,它只需将负输入阈值设置为 0,确保图层仅输出正值,这就解释了为什么生成的图层输出不包含任何负值。(请注意,我们将在 GPT 中使用另一个更复杂的激活函数,我们将在下一节中介绍)。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7f4219cc",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在对这些输出应用层归一化之前,让我们检查均值和方差:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e14c73d6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"mean = out.mean(dim=-1, keepdim=True)\n",
|
||
"var = out.var(dim=-1, keepdim=True)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd30826f",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t输出如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4a6c93ac",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Mean:\n",
|
||
" tensor([[0.1324],\n",
|
||
" [0.2170]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[0.0231],\n",
|
||
" [0.0398]], grad_fn=<VarBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cb269d67",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t上面平均张量中的第一行包含第一输入行的平均值,第二输出行包含第二输入行的平均值。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1a2aad91",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在均值或方差计算等操作中使用 keepdim=True 可确保输出张量保持与输入张量相同的形状,即使该操作沿 dim 指定的维度减少张量也是如此。例如,如果没有 keepdim=True,则返回的平均张量将是二维向量 [0.1324, 0.2170],而不是二维矩阵 [[0.1324], [0.2170]]。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "da5c4a90",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\tdim 参数指定在张量中计算统计数据(此处为均值或方差)的维度,如图 4.6 所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd68fe9b",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.6 计算张量均值时的 dim 参数图示。例如,如果我们有一个维度为 [行、列] 的 2D 张量(矩阵),则使用 dim=0 将跨行(垂直,如底部所示)执行操作,从而产生聚合每列数据的输出。使用 dim=1 或 dim=-1 将跨列执行操作(水平,如顶部所示),从而生成聚合每行数据的输出。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bb7a5e20",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eeb12e38",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t如图 4.6 所示,对于二维张量(如矩阵),使用 dim=-1 进行均值或方差计算等运算与使用 dim=1 相同。这是因为 -1 指的是张量的最后一个维度,它对应于 2D 张量中的列。之后,当向 GPT 模型添加层归一化时,该模型生成形状为 [batch_size、num_tokens、embedding_size] 的 3D 张量,我们仍然可以使用 dim=-1 进行最后一个维度的归一化,避免从 dim=1 更改为 dim=2。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5325ca63",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t接下来,让我们将层归一化应用于我们之前获得的层输出。该操作包括减去均值并除以方差的平方根(也称为标准差):"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "66e13187",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"out_norm = (out - mean) / torch.sqrt(var)\n",
|
||
"mean = out_norm.mean(dim=-1, keepdim=True)\n",
|
||
"var = out_norm.var(dim=-1, keepdim=True)\n",
|
||
"print(\"Normalized layer outputs:\\n\", out_norm)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "637af25d",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t根据结果,我们可以看到,归一化层输出(现在也包含负值)的平均值为零,方差为 1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7ea812b6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Normalized layer outputs:\n",
|
||
" tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],\n",
|
||
" [-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],\n",
|
||
" grad_fn=<DivBackward0>)\n",
|
||
"Mean:\n",
|
||
" tensor([[2.9802e-08],\n",
|
||
" [3.9736e-08]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.],\n",
|
||
" [1.]], grad_fn=<VarBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3dfd9a15",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t请注意,输出张量中的值 2.9802e-08 是 2.9802 × 10-8 的科学记数法,即十进制形式的 0.00000000298。该值非常接近 0,但由于计算机表示数字的精度有限,可能会累积较小的数值误差,因此它并不完全是 0。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0c577985",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t为了提高可读性,我们还可以通过将 sci_mode 设置为 False 来关闭打印张量值时的科学记数法:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9e739ce9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"torch.set_printoptions(sci_mode=False)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)\n",
|
||
"Mean:\n",
|
||
" tensor([[ 0.0000],\n",
|
||
" [ 0.0000]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.],\n",
|
||
" [1.]], grad_fn=<VarBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8810265c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t到目前为止,在本节中,我们已经分步编码和应用了层归一化。现在让我们将这个过程封装在一个 PyTorch 模块中,稍后可以在 GPT 模型中使用:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "773faacb",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Listing 4.2 A 层归一化类**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "98c3e02f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class LayerNorm(nn.Module):\n",
|
||
"\tdef __init__(self, emb_dim):\n",
|
||
"\t\tsuper().__init__()\n",
|
||
"\t\tself.eps = 1e-5\n",
|
||
"\t\tself.scale = nn.Parameter(torch.ones(emb_dim))\n",
|
||
"\t\tself.shift = nn.Parameter(torch.zeros(emb_dim))\n",
|
||
"\tdef forward(self, x):\n",
|
||
"\t\tmean = x.mean(dim=-1, keepdim=True)\n",
|
||
"\t\tvar = x.var(dim=-1, keepdim=True, unbiased=False)\n",
|
||
"\t\tnorm_x = (x - mean) / torch.sqrt(var + self.eps)\n",
|
||
"\t\treturn self.scale * norm_x + self.shift"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "20827f45",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t层归一化的这种特定实现在输入张量 x 的最后一个维度上运行,该维度表示嵌入维度 (emb_dim)。变量 eps 是添加到方差中的一个小常数 (epsilon),以防止在归一化过程中除以零。scale 和 shift 是两个可训练的参数(与输入的维度相同),如果确定这样做会提高模型在其训练任务中的性能,则 LLM 会在训练期间自动调整这些参数。这使模型能够学习最适合其正在处理的数据的适当缩放和移位。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d8ea7351",
|
||
"metadata": {},
|
||
"source": [
|
||
"**偏差方差**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c8de0ee3",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在我们的方差计算方法中,我们通过设置 unbiased=False 来选择实现细节。对于那些对这意味着什么感到好奇的人,在方差计算中,我们除以方差公式中的输入数 n。这种方法不应用贝塞尔校正,贝塞尔校正通常使用分母中的 n-1 而不是 n 来调整样本方差估计中的偏差。这一决定导致了所谓的偏差估计。对于大规模语言模型 (LLM),其中嵌入维度 n 非常大,使用 n 和 n-1 之间的差异几乎可以忽略不计。我们选择这种方法是为了确保与 GPT-2 模型的归一化层兼容,并且因为它反映了 TensorFlow 的默认行为,该行为用于实现原始 GPT-2 模型。使用类似的设置可确保我们的方法与我们将在第 6 章中加载的预训练权重兼容。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4336e78c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t现在让我们在实践中尝试 LayerNorm 模块并将其应用于批处理输入:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "836af1ff",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"ln = LayerNorm(emb_dim=5)\n",
|
||
"out_ln = ln(batch_example)\n",
|
||
"mean = out_ln.mean(dim=-1, keepdim=True)\n",
|
||
"var = out_ln.var(dim=-1, unbiased=False, keepdim=True)\n",
|
||
"print(\"Mean:\\n\", mean)\n",
|
||
"print(\"Variance:\\n\", var)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "82a8d756",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t根据结果,我们可以看到,层归一化代码按预期工作,并归一化两个输入中每个输入的值,使它们的均值为 0,方差为 1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "65693732",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"Mean:\n",
|
||
" tensor([[ -0.0000],\n",
|
||
" [ 0.0000]], grad_fn=<MeanBackward1>)\n",
|
||
"Variance:\n",
|
||
" tensor([[1.0000],\n",
|
||
" [1.0000]], grad_fn=<VarBackward0>)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0c085356",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在本节中,我们介绍了实现 GPT 架构所需的构建块之一,如图 4.7 中的心智模型所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3d2a88bb",
|
||
"metadata": {},
|
||
"source": [
|
||
"图 4.7 一个心智模型,列出了我们在本章中实现的不同构建块,用于组装 GPT 架构。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "27dcaf81",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8ea0e0ee",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t在下一节中,我们将研究 GELU 激活函数,它是 LLM 中使用的激活函数之一,而不是我们在本节中使用的传统 ReLU 函数。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ca7881c5",
|
||
"metadata": {},
|
||
"source": [
|
||
"**层归一化与批量归一化**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "77c57101",
|
||
"metadata": {},
|
||
"source": [
|
||
"\t\t如果你熟悉批量归一化(一种常见且传统的神经网络归一化方法),您可能想知道它与层归一化相比如何。与跨批次维度归一化的批量归一化不同,图层归一化将跨要素维度归一化。LLM 通常需要大量的计算资源,可用的硬件或特定用例可以决定训练或推理期间的批处理大小。由于层归一化独立于批处理大小对每个输入进行归一化,因此在这些场景中提供了更大的灵活性和稳定性。这对于分布式训练或在资源受限的环境中部署模型时特别有用。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|