diff --git a/ch04/01_main-chapter-code/ch04.ipynb b/ch04/01_main-chapter-code/ch04.ipynb index 3f69b6c..cb24802 100644 --- a/ch04/01_main-chapter-code/ch04.ipynb +++ b/ch04/01_main-chapter-code/ch04.ipynb @@ -5,7 +5,7 @@ "id": "ce9295b2-182b-490b-8325-83a67c4a001d", "metadata": {}, "source": [ - "# Chapter 4: Implementing a GPT model from Scratch To Generate Text " + "# 章节 4:从零开始实现 GPT 模型" ] }, { @@ -13,7 +13,7 @@ "id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02", "metadata": {}, "source": [ - "- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM" + "- 在本章中,我们将设计一个类似 GPT 的大型语言模型(LLM)架构;下一章则将聚焦于该模型的训练。" ] }, { @@ -29,7 +29,7 @@ "id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef", "metadata": {}, "source": [ - "## 4.1 Coding an LLM architecture" + "## 4.1 设计LLM的架构" ] }, { @@ -37,10 +37,10 @@ "id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b", "metadata": {}, "source": [ - "- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n", - "- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n", - "- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n", - "- We'll see that many elements are repeated in an LLM's architecture" + "- 第1章探讨了如GPT与Llama等模型,这些模型基于transformer架构的decoder部分,并按顺序生成文本。\n", + "- 因此,这些LLM经常被称为decoder-only LLM。\n", + "- 与传统的深度学习模型相比,LLM更大,这是因为它们有更多的参数,而不是代码量。\n", + "- 而在LLM的架构中,有许多元素是重复的。" ] }, { @@ -56,10 +56,16 @@ "id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99", "metadata": {}, "source": [ - "- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n", - "- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n", - "- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n", - "- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters" + "- 在前几章中,为了方便展示,我们使用了较小的嵌入(embedding)维度来处理token的输入和输出。\n", + "- 在本章中,我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n", + "- 我们将具体实现最小的GPT2-small模型(124M参数)的架构,如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样(注意,GPT2-small的参数量曾被错误的统计为117M参数,后被更正为124M)。\n", + "- 第6章将展示如何将预训练权重加载到我们实现的GPT2中,并兼容345、762和1542M参数的模型大小。\n", + "\n", + "> 译者注:GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量,这一错误后续在模型仓库中被偷偷修正了。\n", + "> \n", + "> 错误的参数量:Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n", + ">\n", + "> 正确的参数量:Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)" ] }, { @@ -67,7 +73,7 @@ "id": "21baa14d-24b8-4820-8191-a2808f7fbabc", "metadata": {}, "source": [ - "- Configuration details for the 124 million parameter GPT-2 model include:" + "- 124M参数GPT-2模型的配置细节包括:" ] }, { @@ -78,11 +84,11 @@ "outputs": [], "source": [ "GPT_CONFIG_124M = {\n", - " \"vocab_size\": 50257, # Vocabulary size\n", - " \"ctx_len\": 1024, # Context length\n", - " \"emb_dim\": 768, # Embedding dimension\n", - " \"n_heads\": 12, # Number of attention heads\n", - " \"n_layers\": 12, # Number of layers\n", + " \"vocab_size\": 50257, # 词表大小\n", + " \"ctx_len\": 1024, # 上下文长度\n", + " \"emb_dim\": 768, # 嵌入维度\n", + " \"n_heads\": 12, # 注意力头(attention heads)的数量\n", + " \"n_layers\": 12, # 模型层数\n", " \"drop_rate\": 0.1, # Dropout rate\n", " \"qkv_bias\": False # Query-Key-Value bias\n", "}" @@ -93,14 +99,14 @@ "id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4", "metadata": {}, "source": [ - "- We use short variable names to avoid long lines of code later\n", - "- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n", - "- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n", - "- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n", - "- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n", - "- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n", - "- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n", - "- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6" + "- 我们使用简短的变量名以避免后续代码行的过长\n", + "- \"vocab_size\" 是一个BPE tokenizer(分词器),词表大小为50257个词,这在第二章介绍过\n", + "- \"ctx_len\" 表示模型支持输入的最大token数量,这数值由第二章中介绍的位置编码决定\n", + "- \"emb_dim\" 是对输入token的嵌入维度,这里会将输入的每个token都嵌入成768维的向量\n", + "- \"n_heads\" 是多头注意力机制中的注意力头数,这在第三章中实现过\n", + "- \"n_layers\" 是模型中transformer blocks的数量,我们将在接下来的部分中实现它。\n", + "- \"drop_rate\" 是第三章中讨论的dropout机制的强度;0.1表示在训练期间丢弃10%的隐藏神经元以缓解过拟合\n", + "- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query(Q),Key(K)和Value(V)张量时是否应包含偏置向量(bias);当代LLM通常不会启用这个选项,我们也不会;但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时,会再次讨论此选项。" ] }, { @@ -128,11 +134,11 @@ " self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n", " self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n", " \n", - " # Use a placeholder for TransformerBlock\n", + " # 先用空白实现顶替下 TransformerBlock\n", " self.trf_blocks = nn.Sequential(\n", " *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n", " \n", - " # Use a placeholder for LayerNorm\n", + " # 先用空白实现顶替下 LayerNorm\n", " self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n", " self.out_head = nn.Linear(\n", " cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n", @@ -153,20 +159,20 @@ "class DummyTransformerBlock(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", - " # A simple placeholder\n", + " # 略\n", "\n", " def forward(self, x):\n", - " # This block does nothing and just returns its input.\n", + " # 先啥也别干,原样返回\n", " return x\n", "\n", "\n", "class DummyLayerNorm(nn.Module):\n", " def __init__(self, normalized_shape, eps=1e-5):\n", " super().__init__()\n", - " # The parameters here are just to mimic the LayerNorm interface.\n", + " # 这里的参数只是为了模拟 LayerNorm 接口。\n", "\n", " def forward(self, x):\n", - " # This layer does nothing and just returns its input.\n", + " # 先啥也别干,原样返回\n", " return x" ] }, @@ -248,7 +254,7 @@ "id": "f8332a00-98da-4eb4-b882-922776a89917", "metadata": {}, "source": [ - "## 4.2 Normalizing activations with layer normalization" + "## 4.2 对激活进行层归一化" ] }, { @@ -256,9 +262,9 @@ "id": "066cfb81-d59b-4d95-afe3-e43cf095f292", "metadata": {}, "source": [ - "- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n", - "- This stabilizes training and enables faster convergence to effective weights\n", - "- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer" + "- 层归一化(Layer normalization),也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)),会将神经网络层的激活值规范到均值为0,并将其方差归一化为1。\n", + "- 这稳定了训练过程,并提高了模型的收敛速度。。\n", + "- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm,一会会实现它;同时,在最终输出层之前也会应用LayerNorm。" ] }, { @@ -274,7 +280,7 @@ "id": "5ab49940-6b35-4397-a80e-df8d092770a7", "metadata": {}, "source": [ - "- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:" + "- 咱们用一个简单的网络,输入一个样本看看LayerNorm是怎么工作的。" ] }, { @@ -296,7 +302,7 @@ "source": [ "torch.manual_seed(123)\n", "\n", - "# create 2 training examples with 5 dimensions (features) each\n", + "# 创建两个训练样例,每个样例有5个维度(特征)\n", "batch_example = torch.randn(2, 5) \n", "\n", "layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n", @@ -309,7 +315,7 @@ "id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e", "metadata": {}, "source": [ - "- Let's compute the mean and variance for each of the 2 inputs above:" + "- 计算上面两个输入的均值和方差:" ] }, { @@ -344,7 +350,7 @@ "id": "052eda3e-b395-48c4-acd4-eb8083bab958", "metadata": {}, "source": [ - "- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension" + "- LayerNorm 会对输入样本分别归一化(下图中的行); 使用`dim=-1`是在最后一个维度(特征维度)而不是行维度(样本数)上进行计算" ] }, { @@ -360,7 +366,7 @@ "id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99", "metadata": {}, "source": [ - "- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:" + "- 减去均值并除以方差的平方根(标准差)会使输入在列(特征)维度上的均值为0,方差为1:" ] }, { @@ -401,7 +407,7 @@ "id": "ac62b90c-7156-4979-9a79-ce1fb92969c1", "metadata": {}, "source": [ - "- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:" + "- 每个输入的均值都为0,方差都为1;为了提高可读性,我们可以关闭PyTorch的科学计数法:" ] }, { @@ -434,8 +440,8 @@ "id": "944fb958-d4ed-43cc-858d-00052bb6b31a", "metadata": {}, "source": [ - "- Above, we normalized the features of each input\n", - "- Now, using the same idea, we can implement a `LayerNorm` class:" + "- 在上面,我们对每个输入的特征进行了归一化\n", + "- 现在,用相同的思路,我们可以实现一个`LayerNorm`类:" ] }, { @@ -464,20 +470,18 @@ "id": "e56c3908-7544-4808-b8cb-5d0a55bcca72", "metadata": {}, "source": [ - "**Scale and shift**\n", + "**缩放和偏移**\n", + "- 注意,除了通过减去均值并除以方差执行归一化之外,我们还添加了两个可训练参数,一个是 `scale`,另一个是 `shift`。\n", + "- 初始的 scale(乘以1)和 shift(加0)值没有任何效果;然而,scale 和 shift 是可训练的参数,如果确定这样做可以改善模型在训练任务上的性能,LLM 在训练过程中会自动调整它们。\n", + "- 这使得模型能够学习适合其处理数据的适当缩放和偏移。\n", + "- 注意,在计算方差的平方根之前,我们还添加了一个较小的值(eps);这是为了避免在方差为0时发生分母为0的问题。\n", "\n", - "- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter\n", - "- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task\n", - "- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing\n", - "- Note that we also add a smaller value (`eps`) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0\n", + "**有偏方差**\n", + "- 在上面的方差计算中,设置 `unbiased=False` 意味着用 $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ 来计算方差,其中 n 是样本大小(在这里是特征或列数);这个公式不包括 Bessel 修正(分母是 n-1),因此得到的方差是有偏估计。\n", + "- 因为LLM的嵌入维度很高,所以使用 n 或 n-1 (有偏或无偏)的区别不大。\n", + "- 但 GPT-2 在LayerNorm中使用了有偏方差进行训练,为了在后续章节能加载现有的预训练权重,咱需要`unbiased`这个变量做兼容。\n", "\n", - "**Biased variance**\n", - "- In the variance calculation above, setting `unbiased=False` means using the formula $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ to compute the variance where n is the sample size (here, the number of features or columns); this formula does not include Bessel's correction (which uses `n-1` in the denominator), thus providing a biased estimate of the variance \n", - "- For LLMs, where the embedding dimension `n` is very large, the difference between using n and `n-1`\n", - " is negligible\n", - "- However, GPT-2 was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later chapters\n", - "\n", - "- Let's now try out `LayerNorm` in practice:" + "- 下面手动实现下 LayerNorm:" ] }, { @@ -531,7 +535,7 @@ "id": "11190e7d-8c29-4115-824a-e03702f9dd54", "metadata": {}, "source": [ - "## 4.3 Implementing a feed forward network with GELU activations" + "## 4.3 使用GELU激活函数实现前馈神经网络" ] }, { @@ -539,11 +543,11 @@ "id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169", "metadata": {}, "source": [ - "- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n", - "- We start with the activation function\n", - "- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n", - "- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n", - "- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU" + "- 在这一节中,我们将实现一个网络子模块,该模块将作为LLM中Transformer block的一部分\n", + "- 我们从激活函数开始\n", + "- 在深度学习中,由于ReLU(Rectified Linear Unit)激活函数在各种神经网络架构中的简单性和有效性,它们经常被使用\n", + "- 在LLM中,除了ReLU之外,还使用了其他类型的激活函数;其中两个值得注意的例子是GELU(Gaussian Error Linear Unit)和SwiGLU(Sigmoid-Weighted Linear Unit)\n", + "- GELU和SwiGLU是更复杂的、平滑的激活函数,它们分别结合了高斯和Sigmoid门控线性单元,为深度学习模型提供了更好的性能,与ReLU的简单分段线性函数不同" ] }, { @@ -551,9 +555,8 @@ "id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc", "metadata": {}, "source": [ - "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n", - "- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n", - "$ (the original GPT-2 model was also trained with this approximation)" + "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现;其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$,其中$\\phi(x)$是标准高斯分布的累积分布函数。\n", + "- 在实际应用中,常常采用计算成本较低的近似形式:$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$(原始的GPT-2模型也是使用这个近似形式进行训练的)。" ] }, { @@ -618,10 +621,9 @@ "id": "1cd01662-14cb-43fd-bffd-2d702813de2d", "metadata": {}, "source": [ - "- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n", - "- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n", - "\n", - "- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:" + "- 显然,ReLU是一个分段线性函数,如果输入是正值,它直接原样输出;否则,输出为零。\n", + "- GELU是一个平滑的非线性函数,近似于ReLU,但输入为负值时,梯度不为0。\n", + "- 接下来,让我们实现小型神经网络模块 FeedForward,稍后我们将在LLM的Transformer block中使用它:" ] }, {