mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-03 13:02:35 +00:00
更新第4章1,2,3,4节的翻译
This commit is contained in:
@@ -5,7 +5,7 @@
|
||||
"id": "ce9295b2-182b-490b-8325-83a67c4a001d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Chapter 4: Implementing a GPT model from Scratch To Generate Text "
|
||||
"# 章节 4:从零开始实现 GPT 模型"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -13,7 +13,7 @@
|
||||
"id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM"
|
||||
"- 在本章中,我们将设计一个类似 GPT 的大型语言模型(LLM)架构;下一章则将聚焦于该模型的训练。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -37,10 +37,10 @@
|
||||
"id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n",
|
||||
"- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n",
|
||||
"- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n",
|
||||
"- We'll see that many elements are repeated in an LLM's architecture"
|
||||
"- 第1章探讨了如GPT与Llama等模型,这些模型基于transformer架构的decoder部分,并按顺序生成文本。\n",
|
||||
"- 因此,这些LLM经常被称为decoder-only LLM。\n",
|
||||
"- 与传统的深度学习模型相比,LLM更大,这是因为它们有更多的参数,而不是代码量。\n",
|
||||
"- 而在LLM的架构中,有许多元素是重复的。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -56,10 +56,16 @@
|
||||
"id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n",
|
||||
"- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n",
|
||||
"- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n",
|
||||
"- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters"
|
||||
"- 在前几章中,为了方便展示,我们使用了较小的嵌入(embedding)维度来处理token的输入和输出。\n",
|
||||
"- 在本章中,我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n",
|
||||
"- 我们将具体实现最小的GPT2-small模型(124M参数)的架构,如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样(注意,GPT2-small的参数量曾被错误的统计为117M参数,后被更正为124M)。\n",
|
||||
"- 第6章将展示如何将预训练权重加载到我们实现的GPT2中,并兼容345、762和1542M参数的模型大小。\n",
|
||||
"\n",
|
||||
"> 译者注:GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量,这一错误后续在模型仓库中被偷偷修正了。\n",
|
||||
"> \n",
|
||||
"> 错误的参数量:Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n",
|
||||
">\n",
|
||||
"> 正确的参数量:Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -67,7 +73,7 @@
|
||||
"id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Configuration details for the 124 million parameter GPT-2 model include:"
|
||||
"- 124M参数GPT-2模型的配置细节包括:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -78,11 +84,11 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"GPT_CONFIG_124M = {\n",
|
||||
" \"vocab_size\": 50257, # Vocabulary size\n",
|
||||
" \"ctx_len\": 1024, # Context length\n",
|
||||
" \"emb_dim\": 768, # Embedding dimension\n",
|
||||
" \"n_heads\": 12, # Number of attention heads\n",
|
||||
" \"n_layers\": 12, # Number of layers\n",
|
||||
" \"vocab_size\": 50257, # 词表大小\n",
|
||||
" \"ctx_len\": 1024, # 上下文长度\n",
|
||||
" \"emb_dim\": 768, # 嵌入维度\n",
|
||||
" \"n_heads\": 12, # 注意力头(attention heads)的数量\n",
|
||||
" \"n_layers\": 12, # 模型层数\n",
|
||||
" \"drop_rate\": 0.1, # Dropout rate\n",
|
||||
" \"qkv_bias\": False # Query-Key-Value bias\n",
|
||||
"}"
|
||||
@@ -93,14 +99,14 @@
|
||||
"id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- We use short variable names to avoid long lines of code later\n",
|
||||
"- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n",
|
||||
"- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n",
|
||||
"- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n",
|
||||
"- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n",
|
||||
"- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n",
|
||||
"- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n",
|
||||
"- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6"
|
||||
"- 我们使用简短的变量名以避免后续代码行的过长\n",
|
||||
"- \"vocab_size\" 是一个BPE tokenizer(分词器),词表大小为50257个词,这在第二章介绍过\n",
|
||||
"- \"ctx_len\" 表示模型支持输入的最大token数量,这数值由第二章中介绍的位置编码决定\n",
|
||||
"- \"emb_dim\" 是对输入token的嵌入维度,这里会将输入的每个token都嵌入成768维的向量\n",
|
||||
"- \"n_heads\" 是多头注意力机制中的注意力头数,这在第三章中实现过\n",
|
||||
"- \"n_layers\" 是模型中transformer blocks的数量,我们将在接下来的部分中实现它。\n",
|
||||
"- \"drop_rate\" 是第三章中讨论的dropout机制的强度;0.1表示在训练期间丢弃10%的隐藏神经元以缓解过拟合\n",
|
||||
"- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query(Q),Key(K)和Value(V)张量时是否应包含偏置向量(bias);当代LLM通常不会启用这个选项,我们也不会;但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时,会再次讨论此选项。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -128,11 +134,11 @@
|
||||
" self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
|
||||
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||||
" \n",
|
||||
" # Use a placeholder for TransformerBlock\n",
|
||||
" # 先用空白实现顶替下 TransformerBlock\n",
|
||||
" self.trf_blocks = nn.Sequential(\n",
|
||||
" *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
|
||||
" \n",
|
||||
" # Use a placeholder for LayerNorm\n",
|
||||
" # 先用空白实现顶替下 LayerNorm\n",
|
||||
" self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
|
||||
" self.out_head = nn.Linear(\n",
|
||||
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
|
||||
@@ -153,20 +159,20 @@
|
||||
"class DummyTransformerBlock(nn.Module):\n",
|
||||
" def __init__(self, cfg):\n",
|
||||
" super().__init__()\n",
|
||||
" # A simple placeholder\n",
|
||||
" # 略\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" # This block does nothing and just returns its input.\n",
|
||||
" # 先啥也别干,原样返回\n",
|
||||
" return x\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class DummyLayerNorm(nn.Module):\n",
|
||||
" def __init__(self, normalized_shape, eps=1e-5):\n",
|
||||
" super().__init__()\n",
|
||||
" # The parameters here are just to mimic the LayerNorm interface.\n",
|
||||
" # 这里的参数只是为了模拟 LayerNorm 接口。\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" # This layer does nothing and just returns its input.\n",
|
||||
" # 先啥也别干,原样返回\n",
|
||||
" return x"
|
||||
]
|
||||
},
|
||||
@@ -256,9 +262,9 @@
|
||||
"id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n",
|
||||
"- This stabilizes training and enables faster convergence to effective weights\n",
|
||||
"- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer"
|
||||
"- 层归一化(Layer normalization),也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)),会将神经网络层的激活值规范到均值为0,并将其方差归一化为1。\n",
|
||||
"- 这稳定了训练过程,并提高了模型的收敛速度。。\n",
|
||||
"- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm,一会会实现它;同时,在最终输出层之前也会应用LayerNorm。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -274,7 +280,7 @@
|
||||
"id": "5ab49940-6b35-4397-a80e-df8d092770a7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:"
|
||||
"- 咱们用一个简单的网络,输入一个样本看看LayerNorm是怎么工作的。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -296,7 +302,7 @@
|
||||
"source": [
|
||||
"torch.manual_seed(123)\n",
|
||||
"\n",
|
||||
"# create 2 training examples with 5 dimensions (features) each\n",
|
||||
"# 创建两个训练样例,每个样例有5个维度(特征)\n",
|
||||
"batch_example = torch.randn(2, 5) \n",
|
||||
"\n",
|
||||
"layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
|
||||
@@ -309,7 +315,7 @@
|
||||
"id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's compute the mean and variance for each of the 2 inputs above:"
|
||||
"- 计算上面两个输入的均值和方差:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -344,7 +350,7 @@
|
||||
"id": "052eda3e-b395-48c4-acd4-eb8083bab958",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension"
|
||||
"- LayerNorm 会对输入样本分别归一化(下图中的行); 使用`dim=-1`是在最后一个维度(特征维度)而不是行维度(样本数)上进行计算"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -360,7 +366,7 @@
|
||||
"id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:"
|
||||
"- 减去均值并除以方差的平方根(标准差)会使输入在列(特征)维度上的均值为0,方差为1:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -401,7 +407,7 @@
|
||||
"id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:"
|
||||
"- 每个输入的均值都为0,方差都为1;为了提高可读性,我们可以关闭PyTorch的科学计数法:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -434,8 +440,8 @@
|
||||
"id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Above, we normalized the features of each input\n",
|
||||
"- Now, using the same idea, we can implement a `LayerNorm` class:"
|
||||
"- 在上面,我们对每个输入的特征进行了归一化\n",
|
||||
"- 现在,用相同的思路,我们可以实现一个`LayerNorm`类:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -531,7 +537,7 @@
|
||||
"id": "11190e7d-8c29-4115-824a-e03702f9dd54",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4.3 Implementing a feed forward network with GELU activations"
|
||||
"## 4.3 使用GELU激活函数实现前馈神经网络"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -539,11 +545,11 @@
|
||||
"id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n",
|
||||
"- We start with the activation function\n",
|
||||
"- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n",
|
||||
"- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n",
|
||||
"- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU"
|
||||
"- 在这一节中,我们将实现一个网络子模块,该模块将作为LLM中Transformer block的一部分\n",
|
||||
"- 我们从激活函数开始\n",
|
||||
"- 在深度学习中,由于ReLU(Rectified Linear Unit)激活函数在各种神经网络架构中的简单性和有效性,它们经常被使用\n",
|
||||
"- 在LLM中,除了ReLU之外,还使用了其他类型的激活函数;其中两个值得注意的例子是GELU(Gaussian Error Linear Unit)和SwiGLU(Sigmoid-Weighted Linear Unit)\n",
|
||||
"- GELU和SwiGLU是更复杂的、平滑的激活函数,它们分别结合了高斯和Sigmoid门控线性单元,为深度学习模型提供了更好的性能,与ReLU的简单分段线性函数不同"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -551,9 +557,8 @@
|
||||
"id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n",
|
||||
"- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n",
|
||||
"$ (the original GPT-2 model was also trained with this approximation)"
|
||||
"- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现;其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$,其中$\\phi(x)$是标准高斯分布的累积分布函数。\n",
|
||||
"- 在实际应用中,常常采用计算成本较低的近似形式:$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$(原始的GPT-2模型也是使用这个近似形式进行训练的)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -618,10 +623,9 @@
|
||||
"id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n",
|
||||
"- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n",
|
||||
"\n",
|
||||
"- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:"
|
||||
"- 显然,ReLU是一个分段线性函数,如果输入是正值,它直接原样输出;否则,输出为零。\n",
|
||||
"- GELU是一个平滑的非线性函数,近似于ReLU,但输入为负值时,梯度不为0。\n",
|
||||
"- 接下来,让我们实现小型神经网络模块 FeedForward,稍后我们将在LLM的Transformer block中使用它:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user