Merge branch 'datawhalechina:main' into main

This commit is contained in:
Beyondzjl
2024-03-03 15:25:56 +08:00
committed by GitHub
4 changed files with 355 additions and 311 deletions
+14 -1
View File
@@ -1,6 +1,19 @@
<center>
# 动手实现LLM中文版
GitHub上的"rasbt/LLMs-from-scratch"项目是一个关于如何从头开始实现类似ChatGPT的大语言模型(LLM)的教程。这个项目包含了编码、预训练和微调GPT-like LLM的代码,并且是《Build a Large Language Model (From Scratch)》这本书的官方代码库。书中详细介绍了LLM的内部工作原理,并逐步指导读者创建自己的LLM,包括每个阶段的清晰文本、图表和示例。这种方法用于训练和开发自己的小型但功能性的模型,用于教育目的,与创建大型基础模型(如ChatGPT背后的模型)的方法相似,翻译后的版本可以服务于国内的开发者。
# LLMs From Scratch: Hands-on Building Your Own Large Language Models
</center>
[![GitHub stars](https://img.shields.io/github/stars/datawhalechina/llms-from-scratch-cn.svg?style=social)](https://github.com/datawhalechina/llms-from-scratch-cn)
[![GitHub forks](https://img.shields.io/github/forks/datawhalechina/llms-from-scratch-cn.svg?style=social)](https://github.com/datawhalechina/llms-from-scratch-cn)
[![GitHub issues](https://img.shields.io/github/issues/datawhalechina/llms-from-scratch-cn.svg)](https://github.com/datawhalechina/llms-from-scratch-cn/issues)
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-brightgreen.svg)](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/LICENSE.txt)
🤗GitHub上的"rasbt/LLMs-from-scratch"项目是一个关于如何从头开始实现类似ChatGPT的大语言模型(LLM)的教程。这个项目包含了编码、预训练和微调GPT-like LLM的代码,并且是《Build a Large Language Model (From Scratch)》这本书的官方代码库。书中详细介绍了LLM的内部工作原理,并逐步指导读者创建自己的LLM,包括每个阶段的清晰文本、图表和示例。这种方法用于训练和开发自己的小型但功能性的模型,用于教育目的,与创建大型基础模型(如ChatGPT背后的模型)的方法相似,翻译后的版本可以服务于国内的开发者。🎉
| 章节标题 | 主要代码(快速访问) | 所有代码 + 补充 |
|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
File diff suppressed because it is too large Load Diff
@@ -5,7 +5,7 @@
"id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
"metadata": {},
"source": [
"# Chapter 3 Exercise solutions"
"# Chapter 3 习题解答"
]
},
{
@@ -13,12 +13,12 @@
"id": "33dfa199-9aee-41d4-a64b-7e3811b9a616",
"metadata": {},
"source": [
"# Exercise 3.1"
"# 3.1"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
"metadata": {},
"outputs": [],
@@ -39,7 +39,7 @@
},
{
"cell_type": "code",
"execution_count": 58,
"execution_count": 2,
"id": "62ea289c-41cd-4416-89dd-dde6383a6f70",
"metadata": {},
"outputs": [],
@@ -72,7 +72,7 @@
},
{
"cell_type": "code",
"execution_count": 59,
"execution_count": 3,
"id": "7b035143-f4e8-45fb-b398-dec1bd5153d4",
"metadata": {},
"outputs": [],
@@ -103,7 +103,7 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 4,
"id": "7591d79c-c30e-406d-adfd-20c12eb448f6",
"metadata": {},
"outputs": [],
@@ -115,7 +115,7 @@
},
{
"cell_type": "code",
"execution_count": 61,
"execution_count": 5,
"id": "ddd0f54f-6bce-46cc-a428-17c2a56557d0",
"metadata": {},
"outputs": [
@@ -130,7 +130,7 @@
" [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
]
},
"execution_count": 61,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -141,7 +141,7 @@
},
{
"cell_type": "code",
"execution_count": 62,
"execution_count": 6,
"id": "340908f8-1144-4ddd-a9e1-a1c5c3d592f5",
"metadata": {},
"outputs": [
@@ -156,7 +156,7 @@
" [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
]
},
"execution_count": 62,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@@ -170,15 +170,15 @@
"id": "33543edb-46b5-4b01-8704-f7f101230544",
"metadata": {},
"source": [
"# Exercise 3.2"
"# 3.2"
]
},
{
"cell_type": "markdown",
"id": "0588e209-1644-496a-8dae-7630b4ef9083",
"id": "1fc1a301",
"metadata": {},
"source": [
"If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:"
"如果我们想要多头注意力机制的输出和之前单头注意力机制一样为 2,我们可以将输出维度 `d_out` 设置为 1"
]
},
{
@@ -227,7 +227,7 @@
"id": "92bdabcb-06cf-4576-b810-d883bbd313ba",
"metadata": {},
"source": [
"# Exercise 3.3"
"# 3.3"
]
},
{
@@ -249,7 +249,7 @@
"id": "375d5290-8e8b-4149-958e-1efb58a69191",
"metadata": {},
"source": [
"Optionally, the number of parameters is as follows:"
"上述实现的参数量为:"
]
},
{
@@ -280,7 +280,9 @@
"id": "a56c1d47-9b95-4bd1-a517-580a6f779c52",
"metadata": {},
"source": [
"The GPT-2 model has 117M parameters in total, but as we can see, most of its parameters are not in the multi-head attention module itself."
"\n",
"\n",
"GPT-2 模型有 117M 的参数,但正如我们所见,绝大部分参数其实都不是来源于多头注意力机制(而是线性层)。"
]
}
],
@@ -300,7 +302,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
"version": "3.9.18"
}
},
"nbformat": 4,
+70 -68
View File
@@ -5,7 +5,7 @@
"id": "ce9295b2-182b-490b-8325-83a67c4a001d",
"metadata": {},
"source": [
"# Chapter 4: Implementing a GPT model from Scratch To Generate Text "
"# 章节 4:从零开始实现 GPT 模型"
]
},
{
@@ -13,7 +13,7 @@
"id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
"metadata": {},
"source": [
"- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM"
"- 在本章中,我们将设计一个类似 GPT 的大型语言模型(LLM)架构;下一章则将聚焦于该模型的训练。"
]
},
{
@@ -29,7 +29,7 @@
"id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef",
"metadata": {},
"source": [
"## 4.1 Coding an LLM architecture"
"## 4.1 设计LLM的架构"
]
},
{
@@ -37,10 +37,10 @@
"id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
"metadata": {},
"source": [
"- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n",
"- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n",
"- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n",
"- We'll see that many elements are repeated in an LLM's architecture"
"- 第1章探讨了如GPT与Llama等模型,这些模型基于transformer架构的decoder部分,并按顺序生成文本。\n",
"- 因此,这些LLM经常被称为decoder-only LLM\n",
"- 与传统的深度学习模型相比,LLM更大,这是因为它们有更多的参数,而不是代码量。\n",
"- 而在LLM的架构中,有许多元素是重复的。"
]
},
{
@@ -56,10 +56,16 @@
"id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
"metadata": {},
"source": [
"- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n",
"- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n",
"- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n",
"- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters"
"- 在前几章中,为了方便展示,我们使用了较小的嵌入(embedding)维度来处理token的输入和输出。\n",
"- 在本章中,我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n",
"- 我们将具体实现最小的GPT2-small模型(124M参数)的架构,如Radford等人在[《Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样(注意,GPT2-small的参数量曾被错误的统计为117M参数,后被更正为124M)。\n",
"- 第6章将展示如何将预训练权重加载到我们实现的GPT2中,并兼容345762和1542M参数的模型大小。\n",
"\n",
"> 译者注:GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量,这一错误后续在模型仓库中被偷偷修正了。\n",
"> \n",
"> 错误的参数量:Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n",
">\n",
"> 正确的参数量:Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)"
]
},
{
@@ -67,7 +73,7 @@
"id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
"metadata": {},
"source": [
"- Configuration details for the 124 million parameter GPT-2 model include:"
"- 124M参数GPT-2模型的配置细节包括:"
]
},
{
@@ -78,11 +84,11 @@
"outputs": [],
"source": [
"GPT_CONFIG_124M = {\n",
" \"vocab_size\": 50257, # Vocabulary size\n",
" \"ctx_len\": 1024, # Context length\n",
" \"emb_dim\": 768, # Embedding dimension\n",
" \"n_heads\": 12, # Number of attention heads\n",
" \"n_layers\": 12, # Number of layers\n",
" \"vocab_size\": 50257, # 词表大小\n",
" \"ctx_len\": 1024, # 上下文长度\n",
" \"emb_dim\": 768, # 嵌入维度\n",
" \"n_heads\": 12, # 注意力头(attention heads)的数量\n",
" \"n_layers\": 12, # 模型层数\n",
" \"drop_rate\": 0.1, # Dropout rate\n",
" \"qkv_bias\": False # Query-Key-Value bias\n",
"}"
@@ -93,14 +99,14 @@
"id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
"metadata": {},
"source": [
"- We use short variable names to avoid long lines of code later\n",
"- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n",
"- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n",
"- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n",
"- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n",
"- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n",
"- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n",
"- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6"
"- 我们使用简短的变量名以避免后续代码行的过长\n",
"- \"vocab_size\" 是一个BPE tokenizer(分词器),词表大小为50257个词,这在第二章介绍过\n",
"- \"ctx_len\" 表示模型支持输入的最大token数量,这数值由第二章中介绍的位置编码决定\n",
"- \"emb_dim\" 是对输入token的嵌入维度,这里会将输入的每个token都嵌入成768维的向量\n",
"- \"n_heads\" 是多头注意力机制中的注意力头数,这在第三章中实现过\n",
"- \"n_layers\" 是模型中transformer blocks的数量,我们将在接下来的部分中实现它。\n",
"- \"drop_rate\" 是第三章中讨论的dropout机制的强度;0.1表示在训练期间丢弃10%的隐藏神经元以缓解过拟合\n",
"- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算QueryQ),Key(K)和Value(V)张量时是否应包含偏置向量(bias);当代LLM通常不会启用这个选项,我们也不会;但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时,会再次讨论此选项。"
]
},
{
@@ -128,11 +134,11 @@
" self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
" \n",
" # Use a placeholder for TransformerBlock\n",
" # 先用空白实现顶替下 TransformerBlock\n",
" self.trf_blocks = nn.Sequential(\n",
" *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
" \n",
" # Use a placeholder for LayerNorm\n",
" # 先用空白实现顶替下 LayerNorm\n",
" self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
" self.out_head = nn.Linear(\n",
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
@@ -153,20 +159,20 @@
"class DummyTransformerBlock(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" # A simple placeholder\n",
" # \n",
"\n",
" def forward(self, x):\n",
" # This block does nothing and just returns its input.\n",
" # 先啥也别干,原样返回\n",
" return x\n",
"\n",
"\n",
"class DummyLayerNorm(nn.Module):\n",
" def __init__(self, normalized_shape, eps=1e-5):\n",
" super().__init__()\n",
" # The parameters here are just to mimic the LayerNorm interface.\n",
" # 这里的参数只是为了模拟 LayerNorm 接口。\n",
"\n",
" def forward(self, x):\n",
" # This layer does nothing and just returns its input.\n",
" # 先啥也别干,原样返回\n",
" return x"
]
},
@@ -248,7 +254,7 @@
"id": "f8332a00-98da-4eb4-b882-922776a89917",
"metadata": {},
"source": [
"## 4.2 Normalizing activations with layer normalization"
"## 4.2 对激活进行层归一化"
]
},
{
@@ -256,9 +262,9 @@
"id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
"metadata": {},
"source": [
"- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n",
"- This stabilizes training and enables faster convergence to effective weights\n",
"- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer"
"- 层归一化(Layer normalization),也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)),会将神经网络层的激活值规范到均值为0,并将其方差归一化为1。\n",
"- 这稳定了训练过程,并提高了模型的收敛速度。。\n",
"- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm,一会会实现它;同时,在最终输出层之前也会应用LayerNorm。"
]
},
{
@@ -274,7 +280,7 @@
"id": "5ab49940-6b35-4397-a80e-df8d092770a7",
"metadata": {},
"source": [
"- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:"
"- 咱们用一个简单的网络,输入一个样本看看LayerNorm是怎么工作的。"
]
},
{
@@ -296,7 +302,7 @@
"source": [
"torch.manual_seed(123)\n",
"\n",
"# create 2 training examples with 5 dimensions (features) each\n",
"# 创建两个训练样例,每个样例有5个维度(特征)\n",
"batch_example = torch.randn(2, 5) \n",
"\n",
"layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
@@ -309,7 +315,7 @@
"id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
"metadata": {},
"source": [
"- Let's compute the mean and variance for each of the 2 inputs above:"
"- 计算上面两个输入的均值和方差:"
]
},
{
@@ -344,7 +350,7 @@
"id": "052eda3e-b395-48c4-acd4-eb8083bab958",
"metadata": {},
"source": [
"- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension"
"- LayerNorm 会对输入样本分别归一化(下图中的行); 使用`dim=-1`是在最后一个维度(特征维度)而不是行维度(样本数)上进行计算"
]
},
{
@@ -360,7 +366,7 @@
"id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
"metadata": {},
"source": [
"- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:"
"- 减去均值并除以方差的平方根(标准差)会使输入在列(特征)维度上的均值为0,方差为1:"
]
},
{
@@ -401,7 +407,7 @@
"id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
"metadata": {},
"source": [
"- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:"
"- 每个输入的均值都为0,方差都为1;为了提高可读性,我们可以关闭PyTorch的科学计数法:"
]
},
{
@@ -434,8 +440,8 @@
"id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
"metadata": {},
"source": [
"- Above, we normalized the features of each input\n",
"- Now, using the same idea, we can implement a `LayerNorm` class:"
"- 在上面,我们对每个输入的特征进行了归一化\n",
"- 现在,用相同的思路,我们可以实现一个`LayerNorm`类:"
]
},
{
@@ -464,20 +470,18 @@
"id": "e56c3908-7544-4808-b8cb-5d0a55bcca72",
"metadata": {},
"source": [
"**Scale and shift**\n",
"**缩放和偏移**\n",
"- 注意,除了通过减去均值并除以方差执行归一化之外,我们还添加了两个可训练参数,一个是 `scale`,另一个是 `shift`。\n",
"- 初始的 scale(乘以1)和 shift(加0)值没有任何效果;然而,scale 和 shift 是可训练的参数,如果确定这样做可以改善模型在训练任务上的性能,LLM 在训练过程中会自动调整它们。\n",
"- 这使得模型能够学习适合其处理数据的适当缩放和偏移。\n",
"- 注意,在计算方差的平方根之前,我们还添加了一个较小的值(eps);这是为了避免在方差为0时发生分母为0的问题。\n",
"\n",
"- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter\n",
"- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task\n",
"- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing\n",
"- Note that we also add a smaller value (`eps`) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0\n",
"**有偏方差**\n",
"- 在上面的方差计算中,设置 `unbiased=False` 意味着用 $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ 来计算方差,其中 n 是样本大小(在这里是特征或列数);这个公式不包括 Bessel 修正(分母是 n-1),因此得到的方差是有偏估计。\n",
"- 因为LLM的嵌入维度很高,所以使用 n 或 n-1 (有偏或无偏)的区别不大。\n",
"- 但 GPT-2 在LayerNorm中使用了有偏方差进行训练,为了在后续章节能加载现有的预训练权重,咱需要`unbiased`这个变量做兼容。\n",
"\n",
"**Biased variance**\n",
"- In the variance calculation above, setting `unbiased=False` means using the formula $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ to compute the variance where n is the sample size (here, the number of features or columns); this formula does not include Bessel's correction (which uses `n-1` in the denominator), thus providing a biased estimate of the variance \n",
"- For LLMs, where the embedding dimension `n` is very large, the difference between using n and `n-1`\n",
" is negligible\n",
"- However, GPT-2 was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later chapters\n",
"\n",
"- Let's now try out `LayerNorm` in practice:"
"- 下面手动实现下 LayerNorm"
]
},
{
@@ -531,7 +535,7 @@
"id": "11190e7d-8c29-4115-824a-e03702f9dd54",
"metadata": {},
"source": [
"## 4.3 Implementing a feed forward network with GELU activations"
"## 4.3 使用GELU激活函数实现前馈神经网络"
]
},
{
@@ -539,11 +543,11 @@
"id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
"metadata": {},
"source": [
"- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n",
"- We start with the activation function\n",
"- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n",
"- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n",
"- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU"
"- 在这一节中,我们将实现一个网络子模块,该模块将作为LLM中Transformer block的一部分\n",
"- 我们从激活函数开始\n",
"- 在深度学习中,由于ReLURectified Linear Unit)激活函数在各种神经网络架构中的简单性和有效性,它们经常被使用\n",
"- 在LLM中,除了ReLU之外,还使用了其他类型的激活函数;其中两个值得注意的例子是GELUGaussian Error Linear Unit)和SwiGLUSigmoid-Weighted Linear Unit\n",
"- GELUSwiGLU是更复杂的、平滑的激活函数,它们分别结合了高斯和Sigmoid门控线性单元,为深度学习模型提供了更好的性能,与ReLU的简单分段线性函数不同"
]
},
{
@@ -551,9 +555,8 @@
"id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
"metadata": {},
"source": [
"- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n",
"- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n",
"$ (the original GPT-2 model was also trained with this approximation)"
"- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现;其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$,其中$\\phi(x)$是标准高斯分布的累积分布函数。\n",
"- 在实际应用中,常常采用计算成本较低的近似形式:$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$(原始的GPT-2模型也是使用这个近似形式进行训练的)。"
]
},
{
@@ -618,10 +621,9 @@
"id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
"metadata": {},
"source": [
"- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n",
"- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n",
"\n",
"- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:"
"- 显然,ReLU是一个分段线性函数,如果输入是正值,它直接原样输出;否则,输出为零。\n",
"- GELU是一个平滑的非线性函数,近似于ReLU,但输入为负值时,梯度不为0。\n",
"- 接下来,让我们实现小型神经网络模块 FeedForward,稍后我们将在LLM的Transformer block中使用它:"
]
},
{