更新第4章1,2,3,4节的翻译

2026-05-03 13:02:35 +00:00 · 2024-03-02 16:58:39 +08:00
parent cb488c4a34
commit 8f4f9174a0
1 changed files with 58 additions and 54 deletions
@@ -5,7 +5,7 @@
   "id": "ce9295b2-182b-490b-8325-83a67c4a001d",
   "metadata": {},
   "source": [
-    "# Chapter 4: Implementing a GPT model from Scratch To Generate Text "
+    "# 章节 4：从零开始实现 GPT 模型"
   ]
  },
  {
@@ -13,7 +13,7 @@
   "id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
   "metadata": {},
   "source": [
-    "- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM"
+    "- 在本章中，我们将设计一个类似 GPT 的大型语言模型（LLM）架构；下一章则将聚焦于该模型的训练。"
   ]
  },
  {
@@ -37,10 +37,10 @@
   "id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
   "metadata": {},
   "source": [
-    "- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n",
-    "- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n",
-    "- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n",
-    "- We'll see that many elements are repeated in an LLM's architecture"
+    "- 第1章探讨了如GPT与Llama等模型，这些模型基于transformer架构的decoder部分，并按顺序生成文本。\n",
+    "- 因此，这些LLM经常被称为decoder-only LLM。\n",
+    "- 与传统的深度学习模型相比，LLM更大，这是因为它们有更多的参数，而不是代码量。\n",
+    "- 而在LLM的架构中，有许多元素是重复的。"
   ]
  },
  {
@@ -56,10 +56,16 @@
   "id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
   "metadata": {},
   "source": [
-    "- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n",
-    "- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n",
-    "- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n",
-    "- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters"
+    "- 在前几章中，为了方便展示，我们使用了较小的嵌入（embedding）维度来处理token的输入和输出。\n",
+    "- 在本章中，我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n",
+    "- 我们将具体实现最小的GPT2-small模型（124M参数）的架构，如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样（注意，GPT2-small的参数量曾被错误的统计为117M参数，后被更正为124M）。\n",
+    "- 第6章将展示如何将预训练权重加载到我们实现的GPT2中，并兼容345、762和1542M参数的模型大小。\n",
+    "\n",
+    "> 译者注：GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量，这一错误后续在模型仓库中被偷偷修正了。\n",
+    "> \n",
+    "> 错误的参数量：Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n",
+    ">\n",
+    "> 正确的参数量：Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)"
   ]
  },
  {
@@ -67,7 +73,7 @@
   "id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
   "metadata": {},
   "source": [
-    "- Configuration details for the 124 million parameter GPT-2 model include:"
+    "- 124M参数GPT-2模型的配置细节包括："
   ]
  },
  {
@@ -78,11 +84,11 @@
   "outputs": [],
   "source": [
    "GPT_CONFIG_124M = {\n",
-    "    \"vocab_size\": 50257,  # Vocabulary size\n",
-    "    \"ctx_len\": 1024,      # Context length\n",
-    "    \"emb_dim\": 768,       # Embedding dimension\n",
-    "    \"n_heads\": 12,        # Number of attention heads\n",
-    "    \"n_layers\": 12,       # Number of layers\n",
+    "    \"vocab_size\": 50257,  # 词表大小\n",
+    "    \"ctx_len\": 1024,      # 上下文长度\n",
+    "    \"emb_dim\": 768,       # 嵌入维度\n",
+    "    \"n_heads\": 12,        # 注意力头（attention heads）的数量\n",
+    "    \"n_layers\": 12,       # 模型层数\n",
    "    \"drop_rate\": 0.1,     # Dropout rate\n",
    "    \"qkv_bias\": False     # Query-Key-Value bias\n",
    "}"
@@ -93,14 +99,14 @@
   "id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
   "metadata": {},
   "source": [
-    "- We use short variable names to avoid long lines of code later\n",
-    "- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n",
-    "- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n",
-    "- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n",
-    "- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n",
-    "- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n",
-    "- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n",
-    "- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6"
+    "- 我们使用简短的变量名以避免后续代码行的过长\n",
+    "- \"vocab_size\" 是一个BPE tokenizer（分词器），词表大小为50257个词，这在第二章介绍过\n",
+    "- \"ctx_len\" 表示模型支持输入的最大token数量，这数值由第二章中介绍的位置编码决定\n",
+    "- \"emb_dim\" 是对输入token的嵌入维度，这里会将输入的每个token都嵌入成768维的向量\n",
+    "- \"n_heads\" 是多头注意力机制中的注意力头数，这在第三章中实现过\n",
+    "- \"n_layers\" 是模型中transformer blocks的数量，我们将在接下来的部分中实现它。\n",
+    "- \"drop_rate\" 是第三章中讨论的dropout机制的强度；0.1表示在训练期间丢弃10％的隐藏神经元以缓解过拟合\n",
+    "- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query（Q），Key（K）和Value（V）张量时是否应包含偏置向量（bias）；当代LLM通常不会启用这个选项，我们也不会；但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时，会再次讨论此选项。"
   ]
  },
  {
@@ -128,11 +134,11 @@
    "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
    "        \n",
-    "        # Use a placeholder for TransformerBlock\n",
+    "        # 先用空白实现顶替下 TransformerBlock\n",
    "        self.trf_blocks = nn.Sequential(\n",
    "            *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
    "        \n",
-    "        # Use a placeholder for LayerNorm\n",
+    "        # 先用空白实现顶替下 LayerNorm\n",
    "        self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
    "        self.out_head = nn.Linear(\n",
    "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
@@ -153,20 +159,20 @@
    "class DummyTransformerBlock(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
-    "        # A simple placeholder\n",
+    "        # 略\n",
    "\n",
    "    def forward(self, x):\n",
-    "        # This block does nothing and just returns its input.\n",
+    "        # 先啥也别干，原样返回\n",
    "        return x\n",
    "\n",
    "\n",
    "class DummyLayerNorm(nn.Module):\n",
    "    def __init__(self, normalized_shape, eps=1e-5):\n",
    "        super().__init__()\n",
-    "        # The parameters here are just to mimic the LayerNorm interface.\n",
+    "        # 这里的参数只是为了模拟 LayerNorm 接口。\n",
    "\n",
    "    def forward(self, x):\n",
-    "        # This layer does nothing and just returns its input.\n",
+    "        # 先啥也别干，原样返回\n",
    "        return x"
   ]
  },
@@ -256,9 +262,9 @@
   "id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
   "metadata": {},
   "source": [
-    "- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n",
-    "- This stabilizes training and enables faster convergence to effective weights\n",
-    "- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer"
+    "- 层归一化（Layer normalization），也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450))，会将神经网络层的激活值规范到均值为0，并将其方差归一化为1。\n",
+    "- 这稳定了训练过程，并提高了模型的收敛速度。。\n",
+    "- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm，一会会实现它；同时，在最终输出层之前也会应用LayerNorm。"
   ]
  },
  {
@@ -274,7 +280,7 @@
   "id": "5ab49940-6b35-4397-a80e-df8d092770a7",
   "metadata": {},
   "source": [
-    "- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:"
+    "- 咱们用一个简单的网络，输入一个样本看看LayerNorm是怎么工作的。"
   ]
  },
  {
@@ -296,7 +302,7 @@
   "source": [
    "torch.manual_seed(123)\n",
    "\n",
-    "# create 2 training examples with 5 dimensions (features) each\n",
+    "# 创建两个训练样例，每个样例有5个维度（特征）\n",
    "batch_example = torch.randn(2, 5) \n",
    "\n",
    "layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
@@ -309,7 +315,7 @@
   "id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
   "metadata": {},
   "source": [
-    "- Let's compute the mean and variance for each of the 2 inputs above:"
+    "- 计算上面两个输入的均值和方差："
   ]
  },
  {
@@ -344,7 +350,7 @@
   "id": "052eda3e-b395-48c4-acd4-eb8083bab958",
   "metadata": {},
   "source": [
-    "- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension"
+    "- LayerNorm 会对输入样本分别归一化（下图中的行）; 使用`dim=-1`是在最后一个维度（特征维度）而不是行维度（样本数）上进行计算"
   ]
  },
  {
@@ -360,7 +366,7 @@
   "id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
   "metadata": {},
   "source": [
-    "- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:"
+    "- 减去均值并除以方差的平方根（标准差）会使输入在列（特征）维度上的均值为0，方差为1："
   ]
  },
  {
@@ -401,7 +407,7 @@
   "id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
   "metadata": {},
   "source": [
-    "- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:"
+    "- 每个输入的均值都为0，方差都为1；为了提高可读性，我们可以关闭PyTorch的科学计数法："
   ]
  },
  {
@@ -434,8 +440,8 @@
   "id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
   "metadata": {},
   "source": [
-    "- Above, we normalized the features of each input\n",
-    "- Now, using the same idea, we can implement a `LayerNorm` class:"
+    "- 在上面，我们对每个输入的特征进行了归一化\n",
+    "- 现在，用相同的思路，我们可以实现一个`LayerNorm`类："
   ]
  },
  {
@@ -531,7 +537,7 @@
   "id": "11190e7d-8c29-4115-824a-e03702f9dd54",
   "metadata": {},
   "source": [
-    "## 4.3 Implementing a feed forward network with GELU activations"
+    "## 4.3 使用GELU激活函数实现前馈神经网络"
   ]
  },
  {
@@ -539,11 +545,11 @@
   "id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
   "metadata": {},
   "source": [
-    "- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n",
-    "- We start with the activation function\n",
-    "- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n",
-    "- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n",
-    "- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU"
+    "- 在这一节中，我们将实现一个网络子模块，该模块将作为LLM中Transformer block的一部分\n",
+    "- 我们从激活函数开始\n",
+    "- 在深度学习中，由于ReLU（Rectified Linear Unit）激活函数在各种神经网络架构中的简单性和有效性，它们经常被使用\n",
+    "- 在LLM中，除了ReLU之外，还使用了其他类型的激活函数；其中两个值得注意的例子是GELU（Gaussian Error Linear Unit）和SwiGLU（Sigmoid-Weighted Linear Unit）\n",
+    "- GELU和SwiGLU是更复杂的、平滑的激活函数，它们分别结合了高斯和Sigmoid门控线性单元，为深度学习模型提供了更好的性能，与ReLU的简单分段线性函数不同"
   ]
  },
  {
@@ -551,9 +557,8 @@
   "id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
   "metadata": {},
   "source": [
-    "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n",
-    "- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n",
-    "$ (the original GPT-2 model was also trained with this approximation)"
+    "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现；其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$，其中$\\phi(x)$是标准高斯分布的累积分布函数。\n",
+    "- 在实际应用中，常常采用计算成本较低的近似形式：$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$（原始的GPT-2模型也是使用这个近似形式进行训练的）。"
   ]
  },
  {
@@ -618,10 +623,9 @@
   "id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
   "metadata": {},
   "source": [
-    "- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n",
-    "- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n",
-    "\n",
-    "- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:"
+    "- 显然，ReLU是一个分段线性函数，如果输入是正值，它直接原样输出；否则，输出为零。\n",
+    "- GELU是一个平滑的非线性函数，近似于ReLU，但输入为负值时，梯度不为0。\n",
+    "- 接下来，让我们实现小型神经网络模块 FeedForward，稍后我们将在LLM的Transformer block中使用它："
   ]
  },
  {