diff --git a/README.md b/README.md index 256574e..46c9bf4 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,19 @@ +
+ # 动手实现LLM中文版 -GitHub上的"rasbt/LLMs-from-scratch"项目是一个关于如何从头开始实现类似ChatGPT的大语言模型(LLM)的教程。这个项目包含了编码、预训练和微调GPT-like LLM的代码,并且是《Build a Large Language Model (From Scratch)》这本书的官方代码库。书中详细介绍了LLM的内部工作原理,并逐步指导读者创建自己的LLM,包括每个阶段的清晰文本、图表和示例。这种方法用于训练和开发自己的小型但功能性的模型,用于教育目的,与创建大型基础模型(如ChatGPT背后的模型)的方法相似,翻译后的版本可以服务于国内的开发者。 +# LLMs From Scratch: Hands-on Building Your Own Large Language Models + +
+ + +[![GitHub stars](https://img.shields.io/github/stars/datawhalechina/llms-from-scratch-cn.svg?style=social)](https://github.com/datawhalechina/llms-from-scratch-cn) +[![GitHub forks](https://img.shields.io/github/forks/datawhalechina/llms-from-scratch-cn.svg?style=social)](https://github.com/datawhalechina/llms-from-scratch-cn) +[![GitHub issues](https://img.shields.io/github/issues/datawhalechina/llms-from-scratch-cn.svg)](https://github.com/datawhalechina/llms-from-scratch-cn/issues) +[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-brightgreen.svg)](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/LICENSE.txt) + + +🤗GitHub上的"rasbt/LLMs-from-scratch"项目是一个关于如何从头开始实现类似ChatGPT的大语言模型(LLM)的教程。这个项目包含了编码、预训练和微调GPT-like LLM的代码,并且是《Build a Large Language Model (From Scratch)》这本书的官方代码库。书中详细介绍了LLM的内部工作原理,并逐步指导读者创建自己的LLM,包括每个阶段的清晰文本、图表和示例。这种方法用于训练和开发自己的小型但功能性的模型,用于教育目的,与创建大型基础模型(如ChatGPT背后的模型)的方法相似,翻译后的版本可以服务于国内的开发者。🎉 | 章节标题 | 主要代码(快速访问) | 所有代码 + 补充 | |------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------| diff --git a/ch03/01_main-chapter-code/ch03.ipynb b/ch03/01_main-chapter-code/ch03.ipynb index 9782019..a08fba0 100644 --- a/ch03/01_main-chapter-code/ch03.ipynb +++ b/ch03/01_main-chapter-code/ch03.ipynb @@ -2,23 +2,23 @@ "cells": [ { "cell_type": "markdown", - "source": [ - "# 第三章:编写注意力机制" - ], + "id": "27d5425deb10849c", "metadata": { "collapsed": false }, - "id": "27d5425deb10849c" + "source": [ + "# 第三章:编写注意力机制" + ] }, { "cell_type": "markdown", - "source": [ - " 在这个notebook中使用的包有:" - ], + "id": "755ce6dff684c41", "metadata": { "collapsed": false }, - "id": "755ce6dff684c41" + "source": [ + " 在这个notebook中使用的包有:" + ] }, { "cell_type": "code", @@ -35,7 +35,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "torch version: 2.1.0\n" + "torch version: 2.1.1\n" ] } ], @@ -48,33 +48,33 @@ }, { "cell_type": "markdown", + "id": "1d0475ea32ec926b", + "metadata": { + "collapsed": false + }, "source": [ " ## 3.1 长序列建模的问题" - ], - "metadata": { - "collapsed": false - }, - "id": "1d0475ea32ec926b" + ] }, { "cell_type": "markdown", + "id": "929f224b96fb1a27", + "metadata": { + "collapsed": false + }, "source": [ "- 这个部分没有代码。" - ], - "metadata": { - "collapsed": false - }, - "id": "929f224b96fb1a27" + ] }, { "cell_type": "markdown", - "source": [ - " ## 3.2 使用注意力机制捕获数据依赖性" - ], + "id": "81f7a179e9cf96c7", "metadata": { "collapsed": false }, - "id": "81f7a179e9cf96c7" + "source": [ + " ## 3.2 使用注意力机制捕获数据依赖性" + ] }, { "cell_type": "markdown", @@ -86,26 +86,30 @@ }, { "cell_type": "markdown", + "id": "4a6fc49dd41e1c19", + "metadata": { + "collapsed": false + }, "source": [ " ## 3.3 使用自注意力关注输入的不同部分" - ], - "metadata": { - "collapsed": false - }, - "id": "4a6fc49dd41e1c19" + ] }, { "cell_type": "markdown", + "id": "c2bf1532f595d316", + "metadata": { + "collapsed": false + }, "source": [ " ### 3.3.1 一个简单的自注意力机制,不包含可训练权重" - ], - "metadata": { - "collapsed": false - }, - "id": "c2bf1532f595d316" + ] }, { "cell_type": "markdown", + "id": "66cdd1ec5345c45e", + "metadata": { + "collapsed": false + }, "source": [ " - 本部分介绍了一个极其简化的自注意力机制版本,它不包含任何可训练的权重。这只是为了说明概念,并不是实际在 transformers 模型中使用的注意力机制。接下来的3.3.2节将扩展这个简单的注意力机制,介绍真正的自注意力机制。\n", "- 假设我们有一个输入序列,从 $x^{(1)}$ 到 $x^{(T)}$。\n", @@ -118,23 +122,19 @@ " - 为了具体说明,我们不是用占位符 $z^{(i)}$,而是考虑第二个输出的上下文向量,$z^{(2)}$。\n", " - 第二个上下文向量 $z^{(2)}$ 是对所有输入 $x^{(1)}$ 到 $x^{(T)}$ 的加权平均,权重是根据第二个输入元素 $x^{(2)}$ 来确定的。这些注意力权重决定了在计算 $z^{(2)}$ 时,每个输入元素对最终加权平均的贡献程度。\n", " - 简而言之,可以把 $z^{(2)}$ 看作是 $x^{(2)}$ 的一个变体,它不仅包含了 $x^{(2)}$ 的信息,还融合了与当前任务相关的所有其他输入元素的信息。" - ], - "metadata": { - "collapsed": false - }, - "id": "66cdd1ec5345c45e" + ] }, { "cell_type": "markdown", + "id": "e89766c8b4d562a1", + "metadata": { + "collapsed": false + }, "source": [ "- 按照惯例,未经归一化的注意力权重被称为**“注意力分数”**,而归一化后的注意力分数(它们的和为1)被称为**“注意力权重”**。\n", "\n", "- 注意力权重和上下文向量的计算总结在下面的图表中:" - ], - "metadata": { - "collapsed": false - }, - "id": "e89766c8b4d562a1" + ] }, { "cell_type": "markdown", @@ -146,6 +146,10 @@ }, { "cell_type": "markdown", + "id": "12dfcc8a4890c11f", + "metadata": { + "collapsed": false + }, "source": [ " - 下面的代码逐步展示了上面图表的内容。\n", "\n", @@ -160,11 +164,7 @@ " - $\\omega_{2T} = x^{(T)} \\cdot q^{(2)\\top}$\n", " - 在这里,$\\omega$ 是希腊字母 \"omega\",用来表示未归一化的注意力分数。\n", " - $\\omega_{21}$ 中的下标 \"21\" 表示输入序列的第2个元素被用作查询,与输入序列的第1个元素进行比较。" - ], - "metadata": { - "collapsed": false - }, - "id": "12dfcc8a4890c11f" + ] }, { "cell_type": "markdown", @@ -176,13 +176,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 假设我们有以下已经转换成3维向量的输入句子,如第三章所述(为了说明方便,这里使用了一个非常小的嵌入维度,以便在不换行的情况下适应页面):" - ], + "id": "f0c28811e45031fd", "metadata": { "collapsed": false }, - "id": "f0c28811e45031fd" + "source": [ + "- 假设我们有以下已经转换成3维向量的输入句子,如第三章所述(为了说明方便,这里使用了一个非常小的嵌入维度,以便在不换行的情况下适应页面):" + ] }, { "cell_type": "code", @@ -210,22 +210,22 @@ }, { "cell_type": "markdown", + "id": "9c98cfa901b290ae", + "metadata": { + "collapsed": false + }, "source": [ "- 我们以输入序列中的第二个元素 $x^{(2)}$ 为例,来计算上下文向量 $z^{(2)}$;在后面的部分,我们将推广这个方法来计算所有的上下文向量。\n", "- 第一步是计算未归一化的注意力分数,通过计算查询 $x^{(2)}$ 与所有其他输入标记之间的点积来实现:" - ], - "metadata": { - "collapsed": false - }, - "id": "9c98cfa901b290ae" + ] }, { "cell_type": "markdown", - "source": [], + "id": "1540227deed6d1da", "metadata": { "collapsed": false }, - "id": "1540227deed6d1da" + "source": [] }, { "cell_type": "code", @@ -265,21 +265,21 @@ }, { "cell_type": "markdown", - "source": [ - "- 注:点积实际上是一种简写,它表示将两个向量的对应元素相乘,然后将这些乘积相加求和:" - ], + "id": "ebbb77f59671f0aa", "metadata": { "collapsed": false }, - "id": "ebbb77f59671f0aa" + "source": [ + "- 注:点积实际上是一种简写,它表示将两个向量的对应元素相乘,然后将这些乘积相加求和:" + ] }, { "cell_type": "markdown", - "source": [], + "id": "5ef00c65e10fddb4", "metadata": { "collapsed": false }, - "id": "5ef00c65e10fddb4" + "source": [] }, { "cell_type": "code", @@ -319,22 +319,22 @@ }, { "cell_type": "markdown", + "id": "389d1ba5c3db582b", + "metadata": { + "collapsed": false + }, "source": [ "- **第二步:** 对未归一化的注意力分数(称为“omegas”,用希腊字母 $\\omega$ 表示)进行归一化处理,使得它们的总和等于1。\n", "- 这里有一个简单的方法来归一化这些未归一化的注意力分数,以确保它们的总和为1(这是一个常用的做法,有助于理解,并且对训练过程的稳定性至关重要):" - ], - "metadata": { - "collapsed": false - }, - "id": "389d1ba5c3db582b" + ] }, { "cell_type": "markdown", - "source": [], + "id": "ffd31c47c7645e04", "metadata": { "collapsed": false }, - "id": "ffd31c47c7645e04" + "source": [] }, { "cell_type": "code", @@ -368,14 +368,14 @@ }, { "cell_type": "markdown", - "source": [ - "- 然而,在实际应用中,使用softmax函数进行归一化更为常见且推荐,因为它更擅长处理极端值,在训练过程中具有更理想的梯度特性。\n", - "- 以下是一个简单的softmax函数实现,它用于缩放,同时归一化向量元素,使得它们的总和为1:" - ], + "id": "f085c4759c607872", "metadata": { "collapsed": false }, - "id": "f085c4759c607872" + "source": [ + "- 然而,在实际应用中,使用softmax函数进行归一化更为常见且推荐,因为它更擅长处理极端值,在训练过程中具有更理想的梯度特性。\n", + "- 以下是一个简单的softmax函数实现,它用于缩放,同时归一化向量元素,使得它们的总和为1:" + ] }, { "cell_type": "code", @@ -418,14 +418,14 @@ }, { "cell_type": "markdown", - "source": [ - "- 上面的简单实现可能会因为输入值过大或过小而导致数值不稳定问题,这主要是因为数值溢出和下溢的问题。\n", - "- 因此,在实际应用中,建议使用 PyTorch 提供的 `softmax` 函数实现,它经过高度优化,性能更优:" - ], + "id": "10bbc97e55cdbd1c", "metadata": { "collapsed": false }, - "id": "10bbc97e55cdbd1c" + "source": [ + "- 上面的简单实现可能会因为输入值过大或过小而导致数值不稳定问题,这主要是因为数值溢出和下溢的问题。\n", + "- 因此,在实际应用中,建议使用 PyTorch 提供的 `softmax` 函数实现,它经过高度优化,性能更优:" + ] }, { "cell_type": "code", @@ -460,13 +460,13 @@ }, { "cell_type": "markdown", - "source": [ - " - **第三步**:通过将嵌入的输入标记 $x^{(i)}$ 与注意力权重相乘,然后将得到的结果向量相加,来计算上下文向量 $z^{(2)}$:" - ], + "id": "26834a0afec960d6", "metadata": { "collapsed": false }, - "id": "26834a0afec960d6" + "source": [ + " - **第三步**:通过将嵌入的输入标记 $x^{(i)}$ 与注意力权重相乘,然后将得到的结果向量相加,来计算上下文向量 $z^{(2)}$:" + ] }, { "cell_type": "code", @@ -505,26 +505,26 @@ }, { "cell_type": "markdown", - "source": [ - "### 3.3.2 计算所有输入标记的注意力权重" - ], + "id": "16b7a0e40c6d8d08", "metadata": { "collapsed": false }, - "id": "16b7a0e40c6d8d08" + "source": [ + "### 3.3.2 计算所有输入标记的注意力权重" + ] }, { "cell_type": "markdown", + "id": "5bfcbe08825a085b", + "metadata": { + "collapsed": false + }, "source": [ "#### 推广到所有输入序列标记:\n", "\n", "- 在上面的内容中,我们计算了输入2的注意力权重和上下文向量(如下面图表中高亮行所示)。\n", "- 接下来,我们将推广这个计算过程,以计算所有输入标记的注意力权重和上下文向量。" - ], - "metadata": { - "collapsed": false - }, - "id": "5bfcbe08825a085b" + ] }, { "cell_type": "markdown", @@ -536,13 +536,13 @@ }, { "cell_type": "markdown", - "source": [ - " - 应用之前的**第一步**,对所有成对的元素进行计算,以得到未归一化的注意力分数矩阵:" - ], + "id": "a8dcd2e858df4af2", "metadata": { "collapsed": false }, - "id": "a8dcd2e858df4af2" + "source": [ + " - 应用之前的**第一步**,对所有成对的元素进行计算,以得到未归一化的注意力分数矩阵:" + ] }, { "cell_type": "code", @@ -585,13 +585,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 我们可以通过矩阵乘法更有效地实现上面的计算:" - ], + "id": "a4a64a8236579473", "metadata": { "collapsed": false }, - "id": "a4a64a8236579473" + "source": [ + "- 我们可以通过矩阵乘法更有效地实现上面的计算:" + ] }, { "cell_type": "code", @@ -628,13 +628,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 与之前的**第二步**类似,我们对每一行进行归一化,以使每一行的值之和为1:" - ], + "id": "277f7ce6c43bf3af", "metadata": { "collapsed": false }, - "id": "277f7ce6c43bf3af" + "source": [ + "- 与之前的**第二步**类似,我们对每一行进行归一化,以使每一行的值之和为1:" + ] }, { "cell_type": "code", @@ -667,13 +667,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 快速验证每一行的值确实之和为1:" - ], + "id": "cd1207b1f9b38e9c", "metadata": { "collapsed": false }, - "id": "cd1207b1f9b38e9c" + "source": [ + "- 快速验证每一行的值确实之和为1:" + ] }, { "cell_type": "code", @@ -708,13 +708,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 应用之前的**第三步**,计算所有上下文向量:" - ], + "id": "2e9e3585324e487a", "metadata": { "collapsed": false }, - "id": "2e9e3585324e487a" + "source": [ + "- 应用之前的**第三步**,计算所有上下文向量:" + ] }, { "cell_type": "code", @@ -747,13 +747,13 @@ }, { "cell_type": "markdown", - "source": [ - "- 作为合理性检查,之前计算的上下文向量 $z^{(2)} = [0.4419, 0.6515, 0.5683]$ 可以在上面的第二行找到:" - ], + "id": "298b13b5bb3d62d1", "metadata": { "collapsed": false }, - "id": "298b13b5bb3d62d1" + "source": [ + "- 作为合理性检查,之前计算的上下文向量 $z^{(2)} = [0.4419, 0.6515, 0.5683]$ 可以在上面的第二行找到:" + ] }, { "cell_type": "code", @@ -1315,7 +1315,7 @@ "id": "c5025b37-0f2c-4a67-a7cb-1286af7026ab", "metadata": {}, "source": [ - "## 3.5 Hiding future words with causal attention" + "## 3.5 遮蔽下文信息的注意力机制" ] }, { @@ -1323,7 +1323,7 @@ "id": "82f405de-cd86-4e72-8f3c-9ea0354946ba", "metadata": {}, "source": [ - "### 3.5.1 Applying a causal attention mask" + "### 3.5.1 使用因果注意力掩码" ] }, { @@ -1331,10 +1331,11 @@ "id": "014f28d0-8218-48e4-8b9c-bdc5ce489218", "metadata": {}, "source": [ - "- In this section, we are converting the previous self-attention mechanism into a causal self-attention mechanism.\n", - "- Causal self-attention ensures that the model's prediction for a certain position in a sequence is only dependent on the known outputs at previous positions, not on future positions.\n", - "- In simpler words, this ensures that each next word prediction should only depend on the preceding words.\n", - "- To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text):" + "- 在本节中,我们将前面的自注意力机制转换为因果自注意力机制。\n", + "\n", + "- 因果自注意力机制的核心目标是,确保模型对序列中某个位置的预测只依赖于前面位置的已知输出(也就是上文),而不依赖于未来位置(也就是下文)。也就是说,确保每一个词的预测只应该依赖于前面的词。\n", + "\n", + "- 为了实现这一点,对于每个给定的词,我们屏蔽掉未来的词(即输入文本中在当前词之后的词)。" ] }, { @@ -1350,7 +1351,7 @@ "id": "cbfaec7a-68f2-4157-a4b5-2aeceed199d9", "metadata": {}, "source": [ - "- To illustrate and implement causal self-attention, let's work with the attention scores and weights from the previous section: " + "- 为了说明和实现因果自注意力机制,让我们使用上一节的注意力分数和权重进行操作。" ] }, { @@ -1374,12 +1375,11 @@ } ], "source": [ - "# Reuse the query and key weight matrices of the\n", - "# SelfAttention_v2 object from the previous section for convenience\n", + "# 使用上一节中 SelfAttention_V2 的 query 和 key 的权重矩阵\n", "queries = sa_v2.W_query(inputs)\n", "keys = sa_v2.W_key(inputs) \n", "attn_scores = queries @ keys.T\n", - "\n", + "# 此处的注意力权重和上一节中的一致\n", "attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n", "print(attn_weights)" ] @@ -1389,7 +1389,7 @@ "id": "89020a96-b34d-41f8-9349-98c3e23fd5d6", "metadata": {}, "source": [ - "- The simplest way to mask out future attention weights is by creating a mask via PyTorch's tril function with elements below the main diagonal (including the diagonal itself) set to 1 and above the main diagonal set to 0:" + "- 屏蔽未来的注意力权重最简单的方法是通过 PyTorch 的 tril 函数创建一个掩码,主对角线(包括对角线本身)以下的元素设置为 1,主对角线以上的元素设置为 0:" ] }, { @@ -1412,7 +1412,9 @@ } ], "source": [ + "# 我们创建的掩码形状应该和注意力权重矩阵的形状一致,以一一对应\n", "block_size = attn_scores.shape[0]\n", + "# tril 方法会创建一个下三角矩阵\n", "mask_simple = torch.tril(torch.ones(block_size, block_size))\n", "print(mask_simple)" ] @@ -1422,7 +1424,7 @@ "id": "efce2b08-3583-44da-b3fc-cabdd38761f6", "metadata": {}, "source": [ - "- Then, we can multiply the attention weights with this mask to zero out the attention scores above the diagonal:" + "- 然后,我们可以将注意力权重与这个掩码相乘,从而将对角线以上的注意力分数归零:" ] }, { @@ -1455,15 +1457,11 @@ "id": "3eb35787-cf12-4024-b66d-e7215e175500", "metadata": {}, "source": [ - "- However, if the mask were applied after softmax, like above, it would disrupt the probability distribution created by softmax. Softmax ensures that all output values sum to 1. Masking after softmax would require re-normalizing the outputs to sum to 1 again, which complicates the process and might lead to unintended effects." - ] - }, - { - "cell_type": "markdown", - "id": "94db92d7-c397-4e42-bd8a-6a2b3e237e0f", - "metadata": {}, - "source": [ - "- To make sure that the rows sum to 1, we can normalize the attention weights as follows:" + "- 然而,如果像上文一样,在 softmax 之后再应用掩码,它会破坏 softmax 创建的概率分布。Softmax将确保所有输出值的总和为 1,但由于我们将部分输出值置为了 0,这将导致输出值总和发生变化。\n", + "\n", + "- 因此,在 softmax 之后进行掩码处理将需要重新对输出进行归一化,使其总和再次为 1。但是,这使得过程变得复杂,并可能导致意想不到的效果。\n", + "\n", + "- 为了确保输出值的总和为 1,我们可以将权重矩阵进行如下的归一化:" ] }, { @@ -1487,6 +1485,7 @@ } ], "source": [ + "# dim = 1 表示按行求和\n", "row_sums = masked_simple.sum(dim=1, keepdim=True)\n", "masked_simple_norm = masked_simple / row_sums\n", "print(masked_simple_norm)" @@ -1497,8 +1496,9 @@ "id": "512e7cf4-dc0e-4cec-948e-c7a3c4eb6877", "metadata": {}, "source": [ - "- While we are technically done with coding the causal attention mechanism now, let's briefly look at a more efficient approach to achieve the same as above.\n", - "- So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:" + "- 尽管我们现在在技术上已经完成了因果注意力机制,但还有一些实现上述相同效果的更有效的方法。\n", + "\n", + "- 例如,我们可以在未归一化的注意力分数进入 softmax 函数之前,用负无穷大掩盖对角线以上的部分,而不是将对角线以上的注意力权重归零并重新归一化结果。" ] }, { @@ -1522,6 +1522,7 @@ } ], "source": [ + "# 也就是说,通过将掩码从 0 修改为 -inf,可以将遮蔽操作提到 softmax 之前\n", "mask = torch.triu(torch.ones(block_size, block_size), diagonal=1)\n", "masked = attn_scores.masked_fill(mask.bool(), -torch.inf)\n", "print(masked)" @@ -1532,7 +1533,7 @@ "id": "91d5f803-d735-4543-b9da-00ac10fb9c50", "metadata": {}, "source": [ - "- As we can see below, now the attention weights in each row correctly sum to 1 again:" + "- 正如我们所见,接下来我们再让注意力矩阵通过 softmax,就可以将每行之和都重新变回 1:" ] }, { @@ -1565,7 +1566,7 @@ "id": "7636fc5f-6bc6-461e-ac6a-99ec8e3c0912", "metadata": {}, "source": [ - "### 3.5.2 Masking additional attention weights with dropout" + "### 3.5.2 通过 dropout 来实现额外注意力权重的掩码" ] }, { @@ -1573,13 +1574,15 @@ "id": "ec3dc7ee-6539-4fab-804a-8f31a890c85a", "metadata": {}, "source": [ - "- In addition, we also apply dropout to reduce overfitting during training.\n", - "- Dropout can be applied in several places:\n", - " - for example, after computing the attention weights;\n", - " - or after multiplying the attention weights with the value vectors.\n", - "- Here, we will apply the dropout mask after computing the attention weights because it's more common.\n", + "- 此外,我们还可以在训练过程中应用 dropout 来减少过拟合。\n", "\n", - "- Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. (When we train the GPT model later, we will use a lower dropout rate, such as 0.1 or 0.2.)" + "- dropout 可以应用于例如下列例子的多个地方:\n", + " - 计算注意力权重后;\n", + " - 将注意力权重与值向量相乘后。\n", + "\n", + "- 在这里,我们将在计算注意力权重后应用 dropout 掩码,因为这种情况更为常见。\n", + "\n", + "- 此外,在这个特定的例子中,我们使用了 50% 的 dropout 率,这意味着随机屏蔽一半的注意力权重。(当我们稍后训练 GPT 模型时,我们将使用较低的 dropout 率,例如 0.1 或 0.2。)" ] }, { @@ -1595,12 +1598,12 @@ "id": "5a575458-a6da-4e54-8688-83e155f2de06", "metadata": {}, "source": [ - "- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2." + "- 注意,如果我们应用 0.5 的 dropout 率,那么未被屏蔽的值将按照 1/0.5 = 2 的比例进行相应缩放。" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 30, "id": "0de578db-8289-41d6-b377-ef645751e33f", "metadata": {}, "outputs": [ @@ -1608,26 +1611,27 @@ "name": "stdout", "output_type": "stream", "text": [ - "tensor([[2., 2., 0., 2., 2., 0.],\n", - " [0., 0., 0., 2., 0., 2.],\n", - " [2., 2., 2., 2., 0., 2.],\n", - " [0., 2., 2., 0., 0., 2.],\n", - " [0., 2., 0., 2., 0., 2.],\n", - " [0., 2., 2., 2., 2., 0.]])\n" + "tensor([[2., 2., 2., 2., 2., 2.],\n", + " [0., 2., 0., 0., 0., 0.],\n", + " [0., 0., 2., 0., 2., 0.],\n", + " [2., 2., 0., 0., 0., 2.],\n", + " [2., 0., 0., 0., 0., 2.],\n", + " [0., 2., 0., 0., 0., 0.]])\n" ] } ], "source": [ + "# 随便设置一个随机数种子\n", "torch.manual_seed(123)\n", - "dropout = torch.nn.Dropout(0.5) # dropout rate of 50%\n", - "example = torch.ones(6, 6) # create a matrix of ones\n", + "dropout = torch.nn.Dropout(0.5) # 设置 50% 的 Dropout 比例\n", + "example = torch.ones(6, 6) # 创建一个全 1 矩阵作为示例\n", "\n", "print(dropout(example))" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 31, "id": "b16c5edb-942b-458c-8e95-25e4e355381e", "metadata": {}, "outputs": [ @@ -1635,18 +1639,19 @@ "name": "stdout", "output_type": "stream", "text": [ - "tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", - " [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", - " [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],\n", - " [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],\n", - " [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],\n", - " [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],\n", + "tensor([[0.3843, 0.3293, 0.3303, 0.3100, 0.3442, 0.3019],\n", + " [0.0000, 0.3318, 0.0000, 0.0000, 0.0000, 0.0000],\n", + " [0.0000, 0.0000, 0.3325, 0.0000, 0.3328, 0.0000],\n", + " [0.3738, 0.3334, 0.0000, 0.0000, 0.0000, 0.3128],\n", + " [0.3661, 0.0000, 0.0000, 0.0000, 0.0000, 0.3169],\n", + " [0.0000, 0.3327, 0.0000, 0.0000, 0.0000, 0.0000]],\n", " grad_fn=)\n" ] } ], "source": [ "torch.manual_seed(123)\n", + "# 对注意力权重进行 dropout\n", "print(dropout(attn_weights))" ] }, @@ -1655,7 +1660,7 @@ "id": "cdc14639-5f0f-4840-aa9d-8eb36ea90fb7", "metadata": {}, "source": [ - "### 3.5.3 Implementing a compact causal self-attention class" + "## 3.5.3 实现一个因果自注意类" ] }, { @@ -1663,14 +1668,16 @@ "id": "09c41d29-1933-43dc-ada6-2dbb56287204", "metadata": {}, "source": [ - "- Now, we are ready to implement a working implementation of self-attention, including the causal and dropout masks. \n", - "- One more thing is to implement the code to handle batches consisting of more than one input so that our `CausalAttention` class supports the batch outputs produced by the data loader we implemented in chapter 2.\n", - "- For simplicity, to simulate such batch input, we duplicate the input text example:" + "- 现在,我们已经准备好实现一个包含 dropout 的因果自注意力类。\n", + "\n", + "- 我们还需要实现处理由多个输入组成的一批样本的代码,以便我们的 CausalAttention 类支持我们在第2章中实现的 dataloader 产生的批量输出。\n", + "\n", + "- 为了简化,为了模拟这样的批量输入,我们复制输入文本示例:" ] }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 32, "id": "977a5fa7-a9d5-4e2e-8a32-8e0331ccfe28", "metadata": {}, "outputs": [ @@ -1684,12 +1691,12 @@ ], "source": [ "batch = torch.stack((inputs, inputs), dim=0)\n", - "print(batch.shape) # 2 inputs with 6 tokens each, and each token has embedding dimension 3" + "print(batch.shape) # 2个输入,每个输入有 6个 token,每个 token 的维度为 3" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 33, "id": "60d8c2eb-2d8e-4d2c-99bc-9eef8cc53ca0", "metadata": {}, "outputs": [ @@ -1715,32 +1722,51 @@ } ], "source": [ + "# 定义一个带 dropout 的因果自注意力层\n", "class CausalAttention(nn.Module):\n", "\n", " def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):\n", + " '''\n", + " 构造函数,输入参数如下:\n", + " d_in: 输入的维度\n", + " d_out: 输出的维度\n", + " block_size: 注意力权重矩阵的大小\n", + " dropout: dropout 比例\n", + " qkv_bias: 是否对 query、key 和 value 加偏置\n", + " '''\n", " super().__init__()\n", " self.d_out = d_out\n", + " # 根据前文,每一个权重矩阵都是 d_in x d_out 的线性层\n", " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n", - " self.dropout = nn.Dropout(dropout) # New\n", + " # 一个 dropout 层\n", + " self.dropout = nn.Dropout(dropout) \n", + " # 一个掩码矩阵,下三角为 1,其余为 0\n", " self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New\n", "\n", " def forward(self, x):\n", - " b, num_tokens, d_in = x.shape # New batch dimension b\n", + " '''\n", + " 前向传播函数,输入参数为 x,维度为 b x num_tokens x d_in,输出维度为 b x num_tokens x d_out\n", + " '''\n", + " b, num_tokens, d_in = x.shape\n", " keys = self.W_key(x)\n", " queries = self.W_query(x)\n", " values = self.W_value(x)\n", - "\n", - " attn_scores = queries @ keys.transpose(1, 2) # Changed transpose\n", + " # transpose 是为了实现矩阵乘法\n", + " attn_scores = queries @ keys.transpose(1, 2)\n", + " # 即上文说过的,将掩码从 0 修改为 -inf,再进行遮蔽操作\n", " attn_scores.masked_fill_( # New, _ ops are in-place\n", - " self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) \n", + " self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)\n", + " # 经过 softmax \n", " attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n", + " # 进行 dropout\n", " attn_weights = self.dropout(attn_weights) # New\n", - "\n", + " # 得到最后结果\n", " context_vec = attn_weights @ values\n", " return context_vec\n", "\n", + "# 实验一下\n", "torch.manual_seed(123)\n", "\n", "block_size = batch.shape[1]\n", @@ -1757,7 +1783,7 @@ "id": "c4333d12-17e4-4bb5-9d83-54b3a32618cd", "metadata": {}, "source": [ - "- Note that dropout is only applied during training, not during inference." + "- 注意 dropout 只在训练阶段被使用,在推理阶段是不使用的" ] }, { @@ -1765,7 +1791,7 @@ "id": "c8bef90f-cfd4-4289-b0e8-6a00dc9be44c", "metadata": {}, "source": [ - "## 3.6 Extending single-head attention to multi-head attention" + "## 3.6 将单头注意力扩展到多头" ] }, { @@ -1773,7 +1799,7 @@ "id": "11697757-9198-4a1c-9cee-f450d8bbd3b9", "metadata": {}, "source": [ - "### 3.6.1 Stacking multiple single-head attention layers" + "### 3.6.1 直接将多个单头注意力层堆积起来" ] }, { @@ -1781,22 +1807,22 @@ "id": "70766faf-cd53-41d9-8a17-f1b229756a5a", "metadata": {}, "source": [ - "- Below is a summary of the self-attention implemented previously (causal and dropout masks not shown for simplicity).\n", + "- 下图是之前提到过的自注意力的总结(为了简便起见,因果注意力掩码和 dropout 并没有展示) \n", "\n", - "- This is also called single-head attention:\n", + "- 也被称之为单头注意力:\n", "\n", "\n", "\n", - "- We simply stack multiple single-head attention modules to obtain a multi-head attention module:\n", + "- 我们可以简单地将多个单头注意力层堆积在一起实现多头注意力层:\n", "\n", "\n", "\n", - "- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions." + "- 多头注意力机制的主要思想是使用不同的、已学习的权重矩阵,多次(并行)运行注意力机制。这使得模型能够在不同位置的不同表示子空间中联合关注信息。" ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 34, "id": "b9a66e11-7105-4bb4-be84-041f1a1f3bd2", "metadata": {}, "outputs": [ @@ -1822,22 +1848,26 @@ } ], "source": [ + "# 定义一个多头注意力层\n", "class MultiHeadAttentionWrapper(nn.Module):\n", "\n", " def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n", " super().__init__()\n", + " # 将 num_heads 个单头注意力层组合在一起来实现多头\n", " self.heads = nn.ModuleList(\n", " [CausalAttention(d_in, d_out, block_size, dropout, qkv_bias) \n", " for _ in range(num_heads)]\n", " )\n", "\n", " def forward(self, x):\n", + " # 前向计算时将多个头的输出拼接在一起\n", " return torch.cat([head(x) for head in self.heads], dim=-1)\n", "\n", "\n", + "# 实验一下\n", "torch.manual_seed(123)\n", "\n", - "block_size = batch.shape[1] # This is the number of tokens\n", + "block_size = batch.shape[1] # token 数量\n", "d_in, d_out = 3, 2\n", "mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)\n", "\n", @@ -1852,9 +1882,9 @@ "id": "193d3d2b-2578-40ba-b791-ea2d49328e48", "metadata": {}, "source": [ - "- In the implementation above, the embedding dimension is 4, because we `d_out=2` as the embedding dimension for the key, query, and value vectors as well as the context vector. And since we have 2 attention heads, we have the output embedding dimension 2*2=4.\n", + "- 在上面的实现中,嵌入维度是4,因为我们为 key、query、value 都设置了 d_out=2 作为嵌入维度。由于我们有2个注意力头,因此输出嵌入维度为 2*2=4。\n", "\n", - "- If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:" + "- 如果我们想要输出维度为2,就像早期的单头注意力那样,我们可以将投影维度 d_out 更改为1:" ] }, { @@ -1901,7 +1931,7 @@ "id": "6836b5da-ef82-4b4c-bda1-72a462e48d4e", "metadata": {}, "source": [ - "### 3.6.2 Implementing multi-head attention with weight splits" + "### 3.6.2 通过权重分割实现多头注意力" ] }, { @@ -1909,14 +1939,14 @@ "id": "f4b48d0d-71ba-4fa0-b714-ca80cabcb6f7", "metadata": {}, "source": [ - "- While the above is an intuitive and fully functional implementation of multi-head attention (wrapping the single-head attention `CausalAttention` implementation from earlier), we can write a stand-alone class called `MultiHeadAttention` to achieve the same.\n", + "- 尽管上述是多头注意力最直观且功能完整的实现(将早期的单头注意力 CausalAttention 实现封装在内),但我们也可以编写一个名为MultiHeadAttention 的独立类来实现相同的功能。\n", "\n", - "- We don't concatenate single attention heads for this stand-alone `MultiHeadAttention` class. Instead, we create single W_query, W_key, and W_value weight matrices and then split those into individual matrices for each attention head:" + "- 对于这个独立的 MultiHeadAttention 类,我们不会将单个注意力头连接在一起。相反,我们创建单个的 W_query、W_key 和 W_value 权重矩阵,然后将它们拆分为每个注意力头的独立矩阵:" ] }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 35, "id": "110b0188-6e9e-4e56-a988-10523c6c8538", "metadata": {}, "outputs": [ @@ -1945,11 +1975,13 @@ "class MultiHeadAttention(nn.Module):\n", " def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n", " super().__init__()\n", + " # 因为要对权重矩阵按注意力头数进行拆分,所有输出维度必须是头数的整数倍\n", " assert d_out % num_heads == 0, \"d_out must be divisible by n_heads\"\n", "\n", " self.d_out = d_out\n", " self.num_heads = num_heads\n", - " self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim\n", + " # head_dim 就是拆分之后每个头应该输出的维度\n", + " self.head_dim = d_out // num_heads \n", "\n", " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", @@ -1961,42 +1993,45 @@ " def forward(self, x):\n", " b, num_tokens, d_in = x.shape\n", "\n", - " keys = self.W_key(x) # Shape: (b, num_tokens, d_out)\n", + " # 形状为 (b, num_tokens, d_out)\n", + " keys = self.W_key(x)\n", " queries = self.W_query(x)\n", " values = self.W_value(x)\n", "\n", - " # We implicitly split the matrix by adding a `num_heads` dimension\n", - " # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n", + " # 我们可以通过增加一个 num_heads 的维度来将矩阵分割到每个头\n", + " # 维度变化: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n", " keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) \n", " values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n", " queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n", "\n", - " # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n", + " # 转置一下: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n", " keys = keys.transpose(1, 2)\n", " queries = queries.transpose(1, 2)\n", " values = values.transpose(1, 2)\n", "\n", - " # Compute scaled dot-product attention (aka self-attention) with a causal mask\n", - " attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head\n", - " # Original mask truncated to the number of tokens and converted to boolean\n", + " # 计算注意力权重\n", + " # 基于矩阵乘法,简单地实现各个头的并行计算\n", + " attn_scores = queries @ keys.transpose(2, 3) \n", + " # 一般来说我们会将掩码矩阵转化为 bool 值并基于序列的长度进行截断\n", " mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n", - " # Unsqueeze the mask twice to match dimensions\n", + " # 需要将掩码矩阵 unsqueeze 两次,也就是增加两个维度,才能让掩码矩阵的维度和注意力权重对应上\n", " mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)\n", - " # Use the unsqueezed mask to fill attention scores\n", + " # 使用掩码矩阵来进行遮蔽\n", " attn_scores.masked_fill_(mask_unsqueezed, -torch.inf)\n", " \n", " attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n", " attn_weights = self.dropout(attn_weights)\n", "\n", - " # Shape: (b, num_tokens, num_heads, head_dim)\n", + " # 形状: (b, num_tokens, num_heads, head_dim)\n", " context_vec = (attn_weights @ values).transpose(1, 2) \n", " \n", - " # Combine heads, where self.d_out = self.num_heads * self.head_dim\n", + " # 将多个头的输出重新组合回去 self.d_out = self.num_heads * self.head_dim\n", " context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n", " context_vec = self.out_proj(context_vec) # optional projection\n", "\n", " return context_vec\n", "\n", + "# 试验一下\n", "torch.manual_seed(123)\n", "\n", "batch_size, block_size, d_in = batch.shape\n", @@ -2014,9 +2049,9 @@ "id": "d334dfb5-2b6c-4c33-82d5-b4e9db5867bb", "metadata": {}, "source": [ - "- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient.\n", - "- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters.\n", - "- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)\n" + "- 请注意,以上内容实际上是 MultiHeadAttentionWrapper 的一个更高效的改写版。\n", + "- 由于随机权重初始化的差异,最终的输出结果看起来有些不同,但两者都是完全可以使用的实现,将在后续章节中实现的GPT类中使用。\n", + "- 此外,我们在上述 MultiHeadAttention 类中添加了一个线性投影层(self.out_proj)。这只是一个不会改变维度的线性变换。在LLM实现中使用这样的投影层是一种标准惯例,但并非严格必要(最近的研究表明,它可以被移除而不会影响建模性能;见本章末尾的进一步阅读部分)" ] }, { @@ -2024,7 +2059,7 @@ "id": "8b0ed78c-e8ac-4f8f-a479-a98242ae8f65", "metadata": {}, "source": [ - "- Note that if you are interested in a compact and efficient implementation of the above, you can also consider the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch." + "- 如果你对更复杂、高效的多头注意力实现感兴趣,你可以考虑使用 PyTorch 的 [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) 类。" ] }, { @@ -2032,7 +2067,7 @@ "id": "363701ad-2022-46c8-9972-390d2a2b9911", "metadata": {}, "source": [ - "- Since the above implementation may look a bit complex at first glance, let's look at what happens when executing `attn_scores = queries @ keys.transpose(2, 3)`:" + "- 上述实现可能看上去有一点复杂,让我们来看一下,当运行 `attn_scores = queries @ keys.transpose(2, 3)` 时会发生什么:" ] }, { @@ -2073,9 +2108,9 @@ "id": "0587b946-c8f2-4888-adbf-5a5032fbfd7b", "metadata": {}, "source": [ - "- In this case, the matrix multiplication implementation in PyTorch will handle the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads. \n", + "- 在这种情况下,PyTorch 中的矩阵乘法实现将处理 4 维输入张量,以便在最后的两个维度(num_tokens,head_dim)之间进行矩阵乘法,然后针对各个头重复进行。\n", "\n", - "- For instance, the above becomes a more compact way to compute the matrix multiplication for each head separately:" + "- 例如,上述内容成为了一种单独计算每个头的更紧凑的矩阵乘法:" ] }, { @@ -2145,7 +2180,7 @@ "id": "dec671bf-7938-4304-ad1e-75d9920e7f43", "metadata": {}, "source": [ - "# Summary and takeaways" + "# 总结与收获" ] }, { @@ -2153,16 +2188,8 @@ "id": "fa3e4113-ffca-432c-b3ec-7a50bd15da25", "metadata": {}, "source": [ - "- See the [./multihead-attention.ipynb](./multihead-attention.ipynb) code notebook, which is a concise version of the data loader (chapter 2) plus the multi-head attention class that we implemented in this chapter and will need for training the GPT model in upcoming chapters." + "- 你可以查看 [./multihead-attention.ipynb](./multihead-attention.ipynb) 代码 Notebook,这是 DataLoader(第2章)的简洁版本,加上我们在本章中实现的多头注意力类,我们将在后续章节中训练 GPT 模型时使用它。" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9f5b7a94-78d0-49d5-896f-21696cb331b7", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -2181,7 +2208,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.9.18" } }, "nbformat": 4, diff --git a/ch03/01_main-chapter-code/exercise-solutions.ipynb b/ch03/01_main-chapter-code/exercise-solutions.ipynb index 161239f..cc47754 100644 --- a/ch03/01_main-chapter-code/exercise-solutions.ipynb +++ b/ch03/01_main-chapter-code/exercise-solutions.ipynb @@ -5,7 +5,7 @@ "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14", "metadata": {}, "source": [ - "# Chapter 3 Exercise solutions" + "# Chapter 3 习题解答" ] }, { @@ -13,12 +13,12 @@ "id": "33dfa199-9aee-41d4-a64b-7e3811b9a616", "metadata": {}, "source": [ - "# Exercise 3.1" + "# 3.1" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 1, "id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2", "metadata": {}, "outputs": [], @@ -39,7 +39,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 2, "id": "62ea289c-41cd-4416-89dd-dde6383a6f70", "metadata": {}, "outputs": [], @@ -72,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 3, "id": "7b035143-f4e8-45fb-b398-dec1bd5153d4", "metadata": {}, "outputs": [], @@ -103,7 +103,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 4, "id": "7591d79c-c30e-406d-adfd-20c12eb448f6", "metadata": {}, "outputs": [], @@ -115,7 +115,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 5, "id": "ddd0f54f-6bce-46cc-a428-17c2a56557d0", "metadata": {}, "outputs": [ @@ -130,7 +130,7 @@ " [-0.5299, -0.1081]], grad_fn=)" ] }, - "execution_count": 61, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -141,7 +141,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 6, "id": "340908f8-1144-4ddd-a9e1-a1c5c3d592f5", "metadata": {}, "outputs": [ @@ -156,7 +156,7 @@ " [-0.5299, -0.1081]], grad_fn=)" ] }, - "execution_count": 62, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -170,15 +170,15 @@ "id": "33543edb-46b5-4b01-8704-f7f101230544", "metadata": {}, "source": [ - "# Exercise 3.2" + "# 3.2" ] }, { "cell_type": "markdown", - "id": "0588e209-1644-496a-8dae-7630b4ef9083", + "id": "1fc1a301", "metadata": {}, "source": [ - "If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:" + "如果我们想要多头注意力机制的输出和之前单头注意力机制一样为 2,我们可以将输出维度 `d_out` 设置为 1:" ] }, { @@ -227,7 +227,7 @@ "id": "92bdabcb-06cf-4576-b810-d883bbd313ba", "metadata": {}, "source": [ - "# Exercise 3.3" + "# 3.3" ] }, { @@ -249,7 +249,7 @@ "id": "375d5290-8e8b-4149-958e-1efb58a69191", "metadata": {}, "source": [ - "Optionally, the number of parameters is as follows:" + "上述实现的参数量为:" ] }, { @@ -280,7 +280,9 @@ "id": "a56c1d47-9b95-4bd1-a517-580a6f779c52", "metadata": {}, "source": [ - "The GPT-2 model has 117M parameters in total, but as we can see, most of its parameters are not in the multi-head attention module itself." + "\n", + "\n", + "GPT-2 模型有 117M 的参数,但正如我们所见,绝大部分参数其实都不是来源于多头注意力机制(而是线性层)。" ] } ], @@ -300,7 +302,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.9.18" } }, "nbformat": 4, diff --git a/ch04/01_main-chapter-code/ch04.ipynb b/ch04/01_main-chapter-code/ch04.ipynb index 3f69b6c..cb24802 100644 --- a/ch04/01_main-chapter-code/ch04.ipynb +++ b/ch04/01_main-chapter-code/ch04.ipynb @@ -5,7 +5,7 @@ "id": "ce9295b2-182b-490b-8325-83a67c4a001d", "metadata": {}, "source": [ - "# Chapter 4: Implementing a GPT model from Scratch To Generate Text " + "# 章节 4:从零开始实现 GPT 模型" ] }, { @@ -13,7 +13,7 @@ "id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02", "metadata": {}, "source": [ - "- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM" + "- 在本章中,我们将设计一个类似 GPT 的大型语言模型(LLM)架构;下一章则将聚焦于该模型的训练。" ] }, { @@ -29,7 +29,7 @@ "id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef", "metadata": {}, "source": [ - "## 4.1 Coding an LLM architecture" + "## 4.1 设计LLM的架构" ] }, { @@ -37,10 +37,10 @@ "id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b", "metadata": {}, "source": [ - "- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n", - "- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n", - "- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n", - "- We'll see that many elements are repeated in an LLM's architecture" + "- 第1章探讨了如GPT与Llama等模型,这些模型基于transformer架构的decoder部分,并按顺序生成文本。\n", + "- 因此,这些LLM经常被称为decoder-only LLM。\n", + "- 与传统的深度学习模型相比,LLM更大,这是因为它们有更多的参数,而不是代码量。\n", + "- 而在LLM的架构中,有许多元素是重复的。" ] }, { @@ -56,10 +56,16 @@ "id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99", "metadata": {}, "source": [ - "- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n", - "- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n", - "- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n", - "- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters" + "- 在前几章中,为了方便展示,我们使用了较小的嵌入(embedding)维度来处理token的输入和输出。\n", + "- 在本章中,我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n", + "- 我们将具体实现最小的GPT2-small模型(124M参数)的架构,如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样(注意,GPT2-small的参数量曾被错误的统计为117M参数,后被更正为124M)。\n", + "- 第6章将展示如何将预训练权重加载到我们实现的GPT2中,并兼容345、762和1542M参数的模型大小。\n", + "\n", + "> 译者注:GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量,这一错误后续在模型仓库中被偷偷修正了。\n", + "> \n", + "> 错误的参数量:Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n", + ">\n", + "> 正确的参数量:Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)" ] }, { @@ -67,7 +73,7 @@ "id": "21baa14d-24b8-4820-8191-a2808f7fbabc", "metadata": {}, "source": [ - "- Configuration details for the 124 million parameter GPT-2 model include:" + "- 124M参数GPT-2模型的配置细节包括:" ] }, { @@ -78,11 +84,11 @@ "outputs": [], "source": [ "GPT_CONFIG_124M = {\n", - " \"vocab_size\": 50257, # Vocabulary size\n", - " \"ctx_len\": 1024, # Context length\n", - " \"emb_dim\": 768, # Embedding dimension\n", - " \"n_heads\": 12, # Number of attention heads\n", - " \"n_layers\": 12, # Number of layers\n", + " \"vocab_size\": 50257, # 词表大小\n", + " \"ctx_len\": 1024, # 上下文长度\n", + " \"emb_dim\": 768, # 嵌入维度\n", + " \"n_heads\": 12, # 注意力头(attention heads)的数量\n", + " \"n_layers\": 12, # 模型层数\n", " \"drop_rate\": 0.1, # Dropout rate\n", " \"qkv_bias\": False # Query-Key-Value bias\n", "}" @@ -93,14 +99,14 @@ "id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4", "metadata": {}, "source": [ - "- We use short variable names to avoid long lines of code later\n", - "- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n", - "- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n", - "- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n", - "- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n", - "- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n", - "- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n", - "- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6" + "- 我们使用简短的变量名以避免后续代码行的过长\n", + "- \"vocab_size\" 是一个BPE tokenizer(分词器),词表大小为50257个词,这在第二章介绍过\n", + "- \"ctx_len\" 表示模型支持输入的最大token数量,这数值由第二章中介绍的位置编码决定\n", + "- \"emb_dim\" 是对输入token的嵌入维度,这里会将输入的每个token都嵌入成768维的向量\n", + "- \"n_heads\" 是多头注意力机制中的注意力头数,这在第三章中实现过\n", + "- \"n_layers\" 是模型中transformer blocks的数量,我们将在接下来的部分中实现它。\n", + "- \"drop_rate\" 是第三章中讨论的dropout机制的强度;0.1表示在训练期间丢弃10%的隐藏神经元以缓解过拟合\n", + "- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query(Q),Key(K)和Value(V)张量时是否应包含偏置向量(bias);当代LLM通常不会启用这个选项,我们也不会;但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时,会再次讨论此选项。" ] }, { @@ -128,11 +134,11 @@ " self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n", " self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n", " \n", - " # Use a placeholder for TransformerBlock\n", + " # 先用空白实现顶替下 TransformerBlock\n", " self.trf_blocks = nn.Sequential(\n", " *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n", " \n", - " # Use a placeholder for LayerNorm\n", + " # 先用空白实现顶替下 LayerNorm\n", " self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n", " self.out_head = nn.Linear(\n", " cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n", @@ -153,20 +159,20 @@ "class DummyTransformerBlock(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", - " # A simple placeholder\n", + " # 略\n", "\n", " def forward(self, x):\n", - " # This block does nothing and just returns its input.\n", + " # 先啥也别干,原样返回\n", " return x\n", "\n", "\n", "class DummyLayerNorm(nn.Module):\n", " def __init__(self, normalized_shape, eps=1e-5):\n", " super().__init__()\n", - " # The parameters here are just to mimic the LayerNorm interface.\n", + " # 这里的参数只是为了模拟 LayerNorm 接口。\n", "\n", " def forward(self, x):\n", - " # This layer does nothing and just returns its input.\n", + " # 先啥也别干,原样返回\n", " return x" ] }, @@ -248,7 +254,7 @@ "id": "f8332a00-98da-4eb4-b882-922776a89917", "metadata": {}, "source": [ - "## 4.2 Normalizing activations with layer normalization" + "## 4.2 对激活进行层归一化" ] }, { @@ -256,9 +262,9 @@ "id": "066cfb81-d59b-4d95-afe3-e43cf095f292", "metadata": {}, "source": [ - "- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n", - "- This stabilizes training and enables faster convergence to effective weights\n", - "- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer" + "- 层归一化(Layer normalization),也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)),会将神经网络层的激活值规范到均值为0,并将其方差归一化为1。\n", + "- 这稳定了训练过程,并提高了模型的收敛速度。。\n", + "- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm,一会会实现它;同时,在最终输出层之前也会应用LayerNorm。" ] }, { @@ -274,7 +280,7 @@ "id": "5ab49940-6b35-4397-a80e-df8d092770a7", "metadata": {}, "source": [ - "- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:" + "- 咱们用一个简单的网络,输入一个样本看看LayerNorm是怎么工作的。" ] }, { @@ -296,7 +302,7 @@ "source": [ "torch.manual_seed(123)\n", "\n", - "# create 2 training examples with 5 dimensions (features) each\n", + "# 创建两个训练样例,每个样例有5个维度(特征)\n", "batch_example = torch.randn(2, 5) \n", "\n", "layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n", @@ -309,7 +315,7 @@ "id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e", "metadata": {}, "source": [ - "- Let's compute the mean and variance for each of the 2 inputs above:" + "- 计算上面两个输入的均值和方差:" ] }, { @@ -344,7 +350,7 @@ "id": "052eda3e-b395-48c4-acd4-eb8083bab958", "metadata": {}, "source": [ - "- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension" + "- LayerNorm 会对输入样本分别归一化(下图中的行); 使用`dim=-1`是在最后一个维度(特征维度)而不是行维度(样本数)上进行计算" ] }, { @@ -360,7 +366,7 @@ "id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99", "metadata": {}, "source": [ - "- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:" + "- 减去均值并除以方差的平方根(标准差)会使输入在列(特征)维度上的均值为0,方差为1:" ] }, { @@ -401,7 +407,7 @@ "id": "ac62b90c-7156-4979-9a79-ce1fb92969c1", "metadata": {}, "source": [ - "- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:" + "- 每个输入的均值都为0,方差都为1;为了提高可读性,我们可以关闭PyTorch的科学计数法:" ] }, { @@ -434,8 +440,8 @@ "id": "944fb958-d4ed-43cc-858d-00052bb6b31a", "metadata": {}, "source": [ - "- Above, we normalized the features of each input\n", - "- Now, using the same idea, we can implement a `LayerNorm` class:" + "- 在上面,我们对每个输入的特征进行了归一化\n", + "- 现在,用相同的思路,我们可以实现一个`LayerNorm`类:" ] }, { @@ -464,20 +470,18 @@ "id": "e56c3908-7544-4808-b8cb-5d0a55bcca72", "metadata": {}, "source": [ - "**Scale and shift**\n", + "**缩放和偏移**\n", + "- 注意,除了通过减去均值并除以方差执行归一化之外,我们还添加了两个可训练参数,一个是 `scale`,另一个是 `shift`。\n", + "- 初始的 scale(乘以1)和 shift(加0)值没有任何效果;然而,scale 和 shift 是可训练的参数,如果确定这样做可以改善模型在训练任务上的性能,LLM 在训练过程中会自动调整它们。\n", + "- 这使得模型能够学习适合其处理数据的适当缩放和偏移。\n", + "- 注意,在计算方差的平方根之前,我们还添加了一个较小的值(eps);这是为了避免在方差为0时发生分母为0的问题。\n", "\n", - "- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter\n", - "- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task\n", - "- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing\n", - "- Note that we also add a smaller value (`eps`) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0\n", + "**有偏方差**\n", + "- 在上面的方差计算中,设置 `unbiased=False` 意味着用 $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ 来计算方差,其中 n 是样本大小(在这里是特征或列数);这个公式不包括 Bessel 修正(分母是 n-1),因此得到的方差是有偏估计。\n", + "- 因为LLM的嵌入维度很高,所以使用 n 或 n-1 (有偏或无偏)的区别不大。\n", + "- 但 GPT-2 在LayerNorm中使用了有偏方差进行训练,为了在后续章节能加载现有的预训练权重,咱需要`unbiased`这个变量做兼容。\n", "\n", - "**Biased variance**\n", - "- In the variance calculation above, setting `unbiased=False` means using the formula $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ to compute the variance where n is the sample size (here, the number of features or columns); this formula does not include Bessel's correction (which uses `n-1` in the denominator), thus providing a biased estimate of the variance \n", - "- For LLMs, where the embedding dimension `n` is very large, the difference between using n and `n-1`\n", - " is negligible\n", - "- However, GPT-2 was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later chapters\n", - "\n", - "- Let's now try out `LayerNorm` in practice:" + "- 下面手动实现下 LayerNorm:" ] }, { @@ -531,7 +535,7 @@ "id": "11190e7d-8c29-4115-824a-e03702f9dd54", "metadata": {}, "source": [ - "## 4.3 Implementing a feed forward network with GELU activations" + "## 4.3 使用GELU激活函数实现前馈神经网络" ] }, { @@ -539,11 +543,11 @@ "id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169", "metadata": {}, "source": [ - "- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n", - "- We start with the activation function\n", - "- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n", - "- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n", - "- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU" + "- 在这一节中,我们将实现一个网络子模块,该模块将作为LLM中Transformer block的一部分\n", + "- 我们从激活函数开始\n", + "- 在深度学习中,由于ReLU(Rectified Linear Unit)激活函数在各种神经网络架构中的简单性和有效性,它们经常被使用\n", + "- 在LLM中,除了ReLU之外,还使用了其他类型的激活函数;其中两个值得注意的例子是GELU(Gaussian Error Linear Unit)和SwiGLU(Sigmoid-Weighted Linear Unit)\n", + "- GELU和SwiGLU是更复杂的、平滑的激活函数,它们分别结合了高斯和Sigmoid门控线性单元,为深度学习模型提供了更好的性能,与ReLU的简单分段线性函数不同" ] }, { @@ -551,9 +555,8 @@ "id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc", "metadata": {}, "source": [ - "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n", - "- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n", - "$ (the original GPT-2 model was also trained with this approximation)" + "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现;其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$,其中$\\phi(x)$是标准高斯分布的累积分布函数。\n", + "- 在实际应用中,常常采用计算成本较低的近似形式:$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$(原始的GPT-2模型也是使用这个近似形式进行训练的)。" ] }, { @@ -618,10 +621,9 @@ "id": "1cd01662-14cb-43fd-bffd-2d702813de2d", "metadata": {}, "source": [ - "- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n", - "- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n", - "\n", - "- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:" + "- 显然,ReLU是一个分段线性函数,如果输入是正值,它直接原样输出;否则,输出为零。\n", + "- GELU是一个平滑的非线性函数,近似于ReLU,但输入为负值时,梯度不为0。\n", + "- 接下来,让我们实现小型神经网络模块 FeedForward,稍后我们将在LLM的Transformer block中使用它:" ] }, {