diff --git a/ch04/01_main-chapter-code/ch04.ipynb b/ch04/01_main-chapter-code/ch04.ipynb
index 3f69b6c..cb24802 100644
--- a/ch04/01_main-chapter-code/ch04.ipynb
+++ b/ch04/01_main-chapter-code/ch04.ipynb
@@ -5,7 +5,7 @@
    "id": "ce9295b2-182b-490b-8325-83a67c4a001d",
    "metadata": {},
    "source": [
-    "# Chapter 4: Implementing a GPT model from Scratch To Generate Text "
+    "# 章节 4：从零开始实现 GPT 模型"
    ]
   },
   {
@@ -13,7 +13,7 @@
    "id": "e7da97ed-e02f-4d7f-b68e-a0eba3716e02",
    "metadata": {},
    "source": [
-    "- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM"
+    "- 在本章中，我们将设计一个类似 GPT 的大型语言模型（LLM）架构；下一章则将聚焦于该模型的训练。"
    ]
   },
   {
@@ -29,7 +29,7 @@
    "id": "53fe99ab-0bcf-4778-a6b5-6db81fb826ef",
    "metadata": {},
    "source": [
-    "## 4.1 Coding an LLM architecture"
+    "## 4.1 设计LLM的架构"
    ]
   },
   {
@@ -37,10 +37,10 @@
    "id": "ad72d1ff-d82d-4e33-a88e-3c1a8831797b",
    "metadata": {},
    "source": [
-    "- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture\n",
-    "- Therefore, these LLMs are often referred to as \"decoder-like\" LLMs\n",
-    "- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code\n",
-    "- We'll see that many elements are repeated in an LLM's architecture"
+    "- 第1章探讨了如GPT与Llama等模型，这些模型基于transformer架构的decoder部分，并按顺序生成文本。\n",
+    "- 因此，这些LLM经常被称为decoder-only LLM。\n",
+    "- 与传统的深度学习模型相比，LLM更大，这是因为它们有更多的参数，而不是代码量。\n",
+    "- 而在LLM的架构中，有许多元素是重复的。"
    ]
   },
   {
@@ -56,10 +56,16 @@
    "id": "0d43f5e2-fb51-434a-b9be-abeef6b98d99",
    "metadata": {},
    "source": [
-    "- In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page\n",
-    "- In this chapter, we consider embedding and model sizes akin to a small GPT-2 model\n",
-    "- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)\n",
-    "- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters"
+    "- 在前几章中，为了方便展示，我们使用了较小的嵌入（embedding）维度来处理token的输入和输出。\n",
+    "- 在本章中，我们将考虑与GPT2-small模型类似的嵌入和模型大小。\n",
+    "- 我们将具体实现最小的GPT2-small模型（124M参数）的架构，如Radford等人在[《Language Models are Unsupervised Multitask Learners》](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中概述的那样（注意，GPT2-small的参数量曾被错误的统计为117M参数，后被更正为124M）。\n",
+    "- 第6章将展示如何将预训练权重加载到我们实现的GPT2中，并兼容345、762和1542M参数的模型大小。\n",
+    "\n",
+    "> 译者注：GPT2的论文《Language Models are Unsupervised Multitask Learners》中错误统计了GPT2系列模型的参数量，这一错误后续在模型仓库中被偷偷修正了。\n",
+    "> \n",
+    "> 错误的参数量：Small (117M)\tMedium (345M)\tLarge (762M)\tXL (1542M)\n",
+    ">\n",
+    "> 正确的参数量：Small (124M)\tMedium (355M)\tLarge (774M)\tXL (1558M)"
    ]
   },
   {
@@ -67,7 +73,7 @@
    "id": "21baa14d-24b8-4820-8191-a2808f7fbabc",
    "metadata": {},
    "source": [
-    "- Configuration details for the 124 million parameter GPT-2 model include:"
+    "- 124M参数GPT-2模型的配置细节包括："
    ]
   },
   {
@@ -78,11 +84,11 @@
    "outputs": [],
    "source": [
     "GPT_CONFIG_124M = {\n",
-    "    \"vocab_size\": 50257,  # Vocabulary size\n",
-    "    \"ctx_len\": 1024,      # Context length\n",
-    "    \"emb_dim\": 768,       # Embedding dimension\n",
-    "    \"n_heads\": 12,        # Number of attention heads\n",
-    "    \"n_layers\": 12,       # Number of layers\n",
+    "    \"vocab_size\": 50257,  # 词表大小\n",
+    "    \"ctx_len\": 1024,      # 上下文长度\n",
+    "    \"emb_dim\": 768,       # 嵌入维度\n",
+    "    \"n_heads\": 12,        # 注意力头（attention heads）的数量\n",
+    "    \"n_layers\": 12,       # 模型层数\n",
     "    \"drop_rate\": 0.1,     # Dropout rate\n",
     "    \"qkv_bias\": False     # Query-Key-Value bias\n",
     "}"
@@ -93,14 +99,14 @@
    "id": "c12fcd28-d210-4c57-8be6-06cfcd5d73a4",
    "metadata": {},
    "source": [
-    "- We use short variable names to avoid long lines of code later\n",
-    "- `\"vocab_size\"` indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2\n",
-    "- `\"ctx_len\"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2\n",
-    "- `\"emb_dim\"` is the embedding size for token inputs, converting each input token into a 768-dimensional vector\n",
-    "- `\"n_heads\"` is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3\n",
-    "- `\"n_layers\"` is the number of transformer blocks within the model, which we'll implement in upcoming sections\n",
-    "- `\"drop_rate\"` is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting\n",
-    "- `\"qkv_bias\"` decides if the `Linear` layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in Chapter 6"
+    "- 我们使用简短的变量名以避免后续代码行的过长\n",
+    "- \"vocab_size\" 是一个BPE tokenizer（分词器），词表大小为50257个词，这在第二章介绍过\n",
+    "- \"ctx_len\" 表示模型支持输入的最大token数量，这数值由第二章中介绍的位置编码决定\n",
+    "- \"emb_dim\" 是对输入token的嵌入维度，这里会将输入的每个token都嵌入成768维的向量\n",
+    "- \"n_heads\" 是多头注意力机制中的注意力头数，这在第三章中实现过\n",
+    "- \"n_layers\" 是模型中transformer blocks的数量，我们将在接下来的部分中实现它。\n",
+    "- \"drop_rate\" 是第三章中讨论的dropout机制的强度；0.1表示在训练期间丢弃10％的隐藏神经元以缓解过拟合\n",
+    "- \"qkv_bias\" 决定第三章中的多头注意力机制中的Linear层在计算Query（Q），Key（K）和Value（V）张量时是否应包含偏置向量（bias）；当代LLM通常不会启用这个选项，我们也不会；但在第六章中将OpenAI预训练的GPT-2权重加载到我们的实现的模型时，会再次讨论此选项。"
    ]
   },
   {
@@ -128,11 +134,11 @@
     "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
     "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
     "        \n",
-    "        # Use a placeholder for TransformerBlock\n",
+    "        # 先用空白实现顶替下 TransformerBlock\n",
     "        self.trf_blocks = nn.Sequential(\n",
     "            *[DummyTransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
     "        \n",
-    "        # Use a placeholder for LayerNorm\n",
+    "        # 先用空白实现顶替下 LayerNorm\n",
     "        self.final_norm = DummyLayerNorm(cfg[\"emb_dim\"])\n",
     "        self.out_head = nn.Linear(\n",
     "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
@@ -153,20 +159,20 @@
     "class DummyTransformerBlock(nn.Module):\n",
     "    def __init__(self, cfg):\n",
     "        super().__init__()\n",
-    "        # A simple placeholder\n",
+    "        # 略\n",
     "\n",
     "    def forward(self, x):\n",
-    "        # This block does nothing and just returns its input.\n",
+    "        # 先啥也别干，原样返回\n",
     "        return x\n",
     "\n",
     "\n",
     "class DummyLayerNorm(nn.Module):\n",
     "    def __init__(self, normalized_shape, eps=1e-5):\n",
     "        super().__init__()\n",
-    "        # The parameters here are just to mimic the LayerNorm interface.\n",
+    "        # 这里的参数只是为了模拟 LayerNorm 接口。\n",
     "\n",
     "    def forward(self, x):\n",
-    "        # This layer does nothing and just returns its input.\n",
+    "        # 先啥也别干，原样返回\n",
     "        return x"
    ]
   },
@@ -248,7 +254,7 @@
    "id": "f8332a00-98da-4eb4-b882-922776a89917",
    "metadata": {},
    "source": [
-    "## 4.2 Normalizing activations with layer normalization"
+    "## 4.2 对激活进行层归一化"
    ]
   },
   {
@@ -256,9 +262,9 @@
    "id": "066cfb81-d59b-4d95-afe3-e43cf095f292",
    "metadata": {},
    "source": [
-    "- Layer normalization, also known as LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450)), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1\n",
-    "- This stabilizes training and enables faster convergence to effective weights\n",
-    "- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer"
+    "- 层归一化（Layer normalization），也叫 LayerNorm ([Ba et al. 2016](https://arxiv.org/abs/1607.06450))，会将神经网络层的激活值规范到均值为0，并将其方差归一化为1。\n",
+    "- 这稳定了训练过程，并提高了模型的收敛速度。。\n",
+    "- Transformer block中多头注意力模块的输入和输出都会应用LayerNorm，一会会实现它；同时，在最终输出层之前也会应用LayerNorm。"
    ]
   },
   {
@@ -274,7 +280,7 @@
    "id": "5ab49940-6b35-4397-a80e-df8d092770a7",
    "metadata": {},
    "source": [
-    "- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:"
+    "- 咱们用一个简单的网络，输入一个样本看看LayerNorm是怎么工作的。"
    ]
   },
   {
@@ -296,7 +302,7 @@
    "source": [
     "torch.manual_seed(123)\n",
     "\n",
-    "# create 2 training examples with 5 dimensions (features) each\n",
+    "# 创建两个训练样例，每个样例有5个维度（特征）\n",
     "batch_example = torch.randn(2, 5) \n",
     "\n",
     "layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())\n",
@@ -309,7 +315,7 @@
    "id": "8fccc29e-71fc-4c16-898c-6137c6ea5d2e",
    "metadata": {},
    "source": [
-    "- Let's compute the mean and variance for each of the 2 inputs above:"
+    "- 计算上面两个输入的均值和方差："
    ]
   },
   {
@@ -344,7 +350,7 @@
    "id": "052eda3e-b395-48c4-acd4-eb8083bab958",
    "metadata": {},
    "source": [
-    "- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension"
+    "- LayerNorm 会对输入样本分别归一化（下图中的行）; 使用`dim=-1`是在最后一个维度（特征维度）而不是行维度（样本数）上进行计算"
    ]
   },
   {
@@ -360,7 +366,7 @@
    "id": "9f8ecbc7-eb14-4fa1-b5d0-7e1ff9694f99",
    "metadata": {},
    "source": [
-    "- Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:"
+    "- 减去均值并除以方差的平方根（标准差）会使输入在列（特征）维度上的均值为0，方差为1："
    ]
   },
   {
@@ -401,7 +407,7 @@
    "id": "ac62b90c-7156-4979-9a79-ce1fb92969c1",
    "metadata": {},
    "source": [
-    "- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:"
+    "- 每个输入的均值都为0，方差都为1；为了提高可读性，我们可以关闭PyTorch的科学计数法："
    ]
   },
   {
@@ -434,8 +440,8 @@
    "id": "944fb958-d4ed-43cc-858d-00052bb6b31a",
    "metadata": {},
    "source": [
-    "- Above, we normalized the features of each input\n",
-    "- Now, using the same idea, we can implement a `LayerNorm` class:"
+    "- 在上面，我们对每个输入的特征进行了归一化\n",
+    "- 现在，用相同的思路，我们可以实现一个`LayerNorm`类："
    ]
   },
   {
@@ -464,20 +470,18 @@
    "id": "e56c3908-7544-4808-b8cb-5d0a55bcca72",
    "metadata": {},
    "source": [
-    "**Scale and shift**\n",
+    "**缩放和偏移**\n",
+    "- 注意，除了通过减去均值并除以方差执行归一化之外，我们还添加了两个可训练参数，一个是 `scale`，另一个是 `shift`。\n",
+    "- 初始的 scale（乘以1）和 shift（加0）值没有任何效果；然而，scale 和 shift 是可训练的参数，如果确定这样做可以改善模型在训练任务上的性能，LLM 在训练过程中会自动调整它们。\n",
+    "- 这使得模型能够学习适合其处理数据的适当缩放和偏移。\n",
+    "- 注意，在计算方差的平方根之前，我们还添加了一个较小的值（eps）；这是为了避免在方差为0时发生分母为0的问题。\n",
     "\n",
-    "- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter\n",
-    "- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task\n",
-    "- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing\n",
-    "- Note that we also add a smaller value (`eps`) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0\n",
+    "**有偏方差**\n",
+    "- 在上面的方差计算中，设置 `unbiased=False` 意味着用 $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ 来计算方差，其中 n 是样本大小（在这里是特征或列数）；这个公式不包括 Bessel 修正（分母是 n-1），因此得到的方差是有偏估计。\n",
+    "- 因为LLM的嵌入维度很高，所以使用 n 或 n-1 （有偏或无偏）的区别不大。\n",
+    "- 但 GPT-2 在LayerNorm中使用了有偏方差进行训练，为了在后续章节能加载现有的预训练权重，咱需要`unbiased`这个变量做兼容。\n",
     "\n",
-    "**Biased variance**\n",
-    "- In the variance calculation above, setting `unbiased=False` means using the formula $\\frac{\\sum_i (x_i - \\bar{x})^2}{n}$ to compute the variance where n is the sample size (here, the number of features or columns); this formula does not include Bessel's correction (which uses `n-1` in the denominator), thus providing a biased estimate of the  variance \n",
-    "- For LLMs, where the embedding dimension `n` is very large, the difference between using n and `n-1`\n",
-    " is negligible\n",
-    "- However, GPT-2 was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later chapters\n",
-    "\n",
-    "- Let's now try out `LayerNorm` in practice:"
+    "- 下面手动实现下 LayerNorm："
    ]
   },
   {
@@ -531,7 +535,7 @@
    "id": "11190e7d-8c29-4115-824a-e03702f9dd54",
    "metadata": {},
    "source": [
-    "## 4.3 Implementing a feed forward network with GELU activations"
+    "## 4.3 使用GELU激活函数实现前馈神经网络"
    ]
   },
   {
@@ -539,11 +543,11 @@
    "id": "b0585dfb-f21e-40e5-973f-2f63ad5cb169",
    "metadata": {},
    "source": [
-    "- In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs\n",
-    "- We start with the activation function\n",
-    "- In deep learning, ReLU (Rectified Linear Unit) activation functions are commonly used due to their simplicity and effectiveness in various neural network architectures\n",
-    "- In LLMs, various other types of activation functions are used beyond the traditional ReLU; two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Sigmoid-Weighted Linear Unit)\n",
-    "- GELU and SwiGLU are more complex, smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively, offering better performance for deep learning models, unlike the simpler, piecewise linear function of ReLU"
+    "- 在这一节中，我们将实现一个网络子模块，该模块将作为LLM中Transformer block的一部分\n",
+    "- 我们从激活函数开始\n",
+    "- 在深度学习中，由于ReLU（Rectified Linear Unit）激活函数在各种神经网络架构中的简单性和有效性，它们经常被使用\n",
+    "- 在LLM中，除了ReLU之外，还使用了其他类型的激活函数；其中两个值得注意的例子是GELU（Gaussian Error Linear Unit）和SwiGLU（Sigmoid-Weighted Linear Unit）\n",
+    "- GELU和SwiGLU是更复杂的、平滑的激活函数，它们分别结合了高斯和Sigmoid门控线性单元，为深度学习模型提供了更好的性能，与ReLU的简单分段线性函数不同"
    ]
   },
   {
@@ -551,9 +555,8 @@
    "id": "7d482ce7-e493-4bfc-a820-3ea99f564ebc",
    "metadata": {},
    "source": [
-    "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415)) can be implemented in several ways; the exact version is defined as GELU(x)=x⋅Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.\n",
-    "- In practice, it's common to implement a computationally cheaper approximation: $\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)\n",
-    "$ (the original GPT-2 model was also trained with this approximation)"
+    "- GELU ([Hendrycks and Gimpel 2016](https://arxiv.org/abs/1606.08415))用多种实现；其精确版本定义为$GELU(x)=x\\cdot \\phi(x)$，其中$\\phi(x)$是标准高斯分布的累积分布函数。\n",
+    "- 在实际应用中，常常采用计算成本较低的近似形式：$\\text{GELU}(x) \\approx 0.5 \\cdot x \\cdot \\left(1 + \\tanh\\left[\\sqrt{\\frac{2}{\\pi}} \\cdot \\left(x + 0.044715 \\cdot x^3\\right)\\right]\\right)$（原始的GPT-2模型也是使用这个近似形式进行训练的）。"
    ]
   },
   {
@@ -618,10 +621,9 @@
    "id": "1cd01662-14cb-43fd-bffd-2d702813de2d",
    "metadata": {},
    "source": [
-    "- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero\n",
-    "- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values\n",
-    "\n",
-    "- Next, let's implement the small neural network module, `FeedForward`, that we will be using in the LLM's transformer block later:"
+    "- 显然，ReLU是一个分段线性函数，如果输入是正值，它直接原样输出；否则，输出为零。\n",
+    "- GELU是一个平滑的非线性函数，近似于ReLU，但输入为负值时，梯度不为0。\n",
+    "- 接下来，让我们实现小型神经网络模块 FeedForward，稍后我们将在LLM的Transformer block中使用它："
    ]
   },
   {