add the new chapters for part 4

2026-05-01 11:58:17 +08:00 · 2024-08-06 23:11:45 +08:00 · 2024-08-06 23:11:45 +08:00 · fcf6ca84ff
commit fcf6ca84ff
parent 9633433eb0
1 changed files with 489 additions and 0 deletions
--- a/Translated_Book/ch04/4.6
+++ b/Translated_Book/ch04/4.6
@ -0,0 +1,489 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "46618527-15ac-4c32-ad85-6cfea83e006e",
   "metadata": {},
   "source": [
    "## 4.6 编码GPT模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dec7d03d-9ff3-4ca3-ad67-01b67c2f5457",
   "metadata": {},
   "source": [
    "- 我们快到了：现在让我们将转换器块插入到本章开头我们编写的架构中，这样我们就可以获得一个可用的GPT架构\n",
    "- 请注意，转换器块会重复多次；在最小的124M GPT-2模型中，我们重复了12次："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b7b362d-f8c5-48d2-8ebd-722480ac5073",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/15.webp\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "324e4b5d-ed89-4fdf-9a52-67deee0593bc",
   "metadata": {},
   "source": [
    "- 对应的代码实现，其中 `cfg[\"n_layers\"] = 12`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "c61de39c-d03c-4a32-8b57-f49ac3834857",
   "metadata": {},
   "outputs": [],
   "source": [
    "class GPTModel(nn.Module):\n",
    "    def __init__(self, cfg):\n",
    "        super().__init__()\n",
    "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
    "        self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n",
    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
    "        \n",
    "        self.trf_blocks = nn.Sequential(\n",
    "            *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
    "        \n",
    "        self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
    "        self.out_head = nn.Linear(\n",
    "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
    "        )\n",
    "\n",
    "    def forward(self, in_idx):\n",
    "        batch_size, seq_len = in_idx.shape\n",
    "        tok_embeds = self.tok_emb(in_idx)\n",
    "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
    "        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
    "        x = self.drop_emb(x)\n",
    "        x = self.trf_blocks(x)\n",
    "        x = self.final_norm(x)\n",
    "        logits = self.out_head(x)\n",
    "        return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2750270f-c45d-4410-8767-a6adbd05d5c3",
   "metadata": {},
   "source": [
    "- 使用124M参数模型的配置，我们现在可以如下以随机初始权重实例化这个GPT模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "ef94fd9c-4e9d-470d-8f8e-dd23d1bb1f64",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input batch:\n",
      " tensor([[6109, 3626, 6100,  345],\n",
      "        [6109, 1110, 6622,  257]])\n",
      "\n",
      "Output shape: torch.Size([2, 4, 50257])\n",
      "tensor([[[ 0.3613,  0.4222, -0.0711,  ...,  0.3483,  0.4661, -0.2838],\n",
      "         [-0.1792, -0.5660, -0.9485,  ...,  0.0477,  0.5181, -0.3168],\n",
      "         [ 0.7120,  0.0332,  0.1085,  ...,  0.1018, -0.4327, -0.2553],\n",
      "         [-1.0076,  0.3418, -0.1190,  ...,  0.7195,  0.4023,  0.0532]],\n",
      "\n",
      "        [[-0.2564,  0.0900,  0.0335,  ...,  0.2659,  0.4454, -0.6806],\n",
      "         [ 0.1230,  0.3653, -0.2074,  ...,  0.7705,  0.2710,  0.2246],\n",
      "         [ 1.0558,  1.0318, -0.2800,  ...,  0.6936,  0.3205, -0.3178],\n",
      "         [-0.1565,  0.3926,  0.3288,  ...,  1.2630, -0.1858,  0.0388]]],\n",
      "       grad_fn=<UnsafeViewBackward0>)\n"
     ]
    }
   ],
   "source": [
    "torch.manual_seed(123)\n",
    "model = GPTModel(GPT_CONFIG_124M)\n",
    "\n",
    "out = model(batch)\n",
    "print(\"Input batch:\\n\", batch)\n",
    "print(\"\\nOutput shape:\", out.shape)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d616e7a-568b-4921-af29-bd3f4683cd2e",
   "metadata": {},
   "source": [
    "\n",
    "- 我们将在下一章中训练这个模型\n",
    "- 但是，关于它的大小，我们需要快速说明一下：我们之前提到它是一个包含1.24亿参数的模型；我们可以按以下方式再次确认这个数字：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "84fb8be4-9d3b-402b-b3da-86b663aac33a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of parameters: 163,009,536\n"
     ]
    }
   ],
   "source": [
    "total_params = sum(p.numel() for p in model.parameters())\n",
    "print(f\"Total number of parameters: {total_params:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b67d13dd-dd01-4ba6-a2ad-31ca8a9fd660",
   "metadata": {},
   "source": [
    "- 在原始的GPT-2论文中，研究人员应用了权重绑定（weight tying），这意味着他们重用了标记嵌入层（tok_emb）作为输出层，即设置self.out_head.weight = self.tok_emb.weight\n",
    "- 标记嵌入层将50,257维的独热编码输入标记投影到768维的嵌入表示\n",
    "- 输出层将768维的嵌入投影回50,257维的表示，以便我们可以将这些转换回单词（有关此的更多信息，请参见下一节）\n",
    "- 因此，嵌入层和输出层具有相同数量的权重参数，我们可以根据其权重矩阵的形状看到这一点\n",
    "- 但是，关于其大小的一个快速说明：我们之前将其称为124M参数模型；我们可以按以下方式检查此数字"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "e3b43233-e9b8-4f5a-b72b-a263ec686982",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token embedding layer shape: torch.Size([50257, 768])\n",
      "Output layer shape: torch.Size([50257, 768])\n"
     ]
    }
   ],
   "source": [
    "print(\"Token embedding layer shape:\", model.tok_emb.weight.shape)\n",
    "print(\"Output layer shape:\", model.out_head.weight.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f02259f6-6f79-4c89-a866-4ebeae1c3289",
   "metadata": {},
   "source": [
    "\n",
    "- 在原始的GPT-2论文中，研究人员将标记嵌入矩阵作为输出矩阵重新使用\n",
    "- 相应地，如果我们减去输出层的参数数量，我们将得到一个124M参数的模型：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "95a22e02-50d3-48b3-a4e0-d9863343c164",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of trainable parameters considering weight tying: 124,412,160\n"
     ]
    }
   ],
   "source": [
    "total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())\n",
    "print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40b03f80-b94c-46e7-9d42-d0df399ff3db",
   "metadata": {},
   "source": [
    "\n",
    "- 在实践中，我发现没有权重绑定更容易训练模型，这就是为什么我们在这里没有实现它\n",
    "- 但是，当我们在第5章加载预训练权重时，我们将重新检查并应用这个权重绑定的想法\n",
    "- 最后，我们可以按如下方式计算模型的内存需求，这可以作为一个有用的参考点：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "5131a752-fab8-4d70-a600-e29870b33528",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total size of the model: 621.83 MB\n"
     ]
    }
   ],
   "source": [
    "# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
    "total_size_bytes = total_params * 4\n",
    "\n",
    "# Convert to megabytes\n",
    "total_size_mb = total_size_bytes / (1024 * 1024)\n",
    "\n",
    "print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "309a3be4-c20a-4657-b4e0-77c97510b47c",
   "metadata": {},
   "source": [
    "- 练习：您可以尝试下面的其他配置，这些配置也在[GPT-2论文](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中被引用。\n",
    "\n",
    "- **GPT2-small**（我们已经实现的124M配置）：\n",
    "  - \"emb_dim\" = 768\n",
    "  - \"n_layers\" = 12\n",
    "  - \"n_heads\" = 12\n",
    "\n",
    "- **GPT2-medium：**\n",
    "  - \"emb_dim\" = 1024\n",
    "  - \"n_layers\" = 24\n",
    "  - \"n_heads\" = 16\n",
    "\n",
    "- **GPT2-large：**\n",
    "  - \"emb_dim\" = 1280\n",
    "  - \"n_layers\" = 36\n",
    "  - \"n_heads\" = 20\n",
    "\n",
    "- **GPT2-XL：**\n",
    "  - \"emb_dim\" = 1600\n",
    "  - \"n_layers\" = 48\n",
    "  - \"n_heads\" = 25"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da5d9bc0-95ab-45d4-9378-417628d86e35",
   "metadata": {},
   "source": [
    "## 4.7 Generating text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48da5deb-6ee0-4b9b-8dd2-abed7ed65172",
   "metadata": {},
   "source": [
    "- LLMs like the GPT model we implemented above are used to generate one word at a time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caade12a-fe97-480f-939c-87d24044edff",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/16.webp\" width=\"400px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7061524-a3bd-4803-ade6-2e3b7b79ac13",
   "metadata": {},
   "source": [
    "- The following `generate_text_simple` function implements greedy decoding, which is a simple and fast method to generate text\n",
    "- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)\n",
    "- In the next chapter, we will implement a more advanced `generate_text` function\n",
    "- The figure below depicts how the GPT model, given an input context, generates the next word token"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ee0f32c-c18c-445e-b294-a879de2aa187",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/17.webp\" width=\"600px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "c9b428a9-8764-4b36-80cd-7d4e00595ba6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
    "    # idx is (batch, n_tokens) array of indices in the current context\n",
    "    for _ in range(max_new_tokens):\n",
    "        \n",
    "        # Crop current context if it exceeds the supported context size\n",
    "        # E.g., if LLM supports only 5 tokens, and the context size is 10\n",
    "        # then only the last 5 tokens are used as context\n",
    "        idx_cond = idx[:, -context_size:]\n",
    "        \n",
    "        # Get the predictions\n",
    "        with torch.no_grad():\n",
    "            logits = model(idx_cond)\n",
    "        \n",
    "        # Focus only on the last time step\n",
    "        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n",
    "        logits = logits[:, -1, :]  \n",
    "\n",
    "        # Apply softmax to get probabilities\n",
    "        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)\n",
    "\n",
    "        # Get the idx of the vocab entry with the highest probability value\n",
    "        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)\n",
    "\n",
    "        # Append sampled index to the running sequence\n",
    "        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)\n",
    "\n",
    "    return idx"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6515f2c1-3cc7-421c-8d58-cc2f563b7030",
   "metadata": {},
   "source": [
    "- The `generate_text_simple` above implements an iterative process, where it creates one token at a time\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/18.webp\" width=\"600px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f682eac4-f9bd-438b-9dec-6b1cc7bc05ce",
   "metadata": {},
   "source": [
    "- Let's prepare an input example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "3d7e3e94-df0f-4c0f-a6a1-423f500ac1d3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "encoded: [15496, 11, 314, 716]\n",
      "encoded_tensor.shape: torch.Size([1, 4])\n"
     ]
    }
   ],
   "source": [
    "start_context = \"Hello, I am\"\n",
    "\n",
    "encoded = tokenizer.encode(start_context)\n",
    "print(\"encoded:\", encoded)\n",
    "\n",
    "encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n",
    "print(\"encoded_tensor.shape:\", encoded_tensor.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a72a9b60-de66-44cf-b2f9-1e638934ada4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])\n",
      "Output length: 10\n"
     ]
    }
   ],
   "source": [
    "model.eval() # disable dropout\n",
    "\n",
    "out = generate_text_simple(\n",
    "    model=model,\n",
    "    idx=encoded_tensor, \n",
    "    max_new_tokens=6, \n",
    "    context_size=GPT_CONFIG_124M[\"context_length\"]\n",
    ")\n",
    "\n",
    "print(\"Output:\", out)\n",
    "print(\"Output length:\", len(out[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d131c00-1787-44ba-bec3-7c145497b2c3",
   "metadata": {},
   "source": [
    "- Remove batch dimension and convert back into text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "053d99f6-5710-4446-8d52-117fb34ea9f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello, I am Featureiman Byeswickattribute argue\n"
     ]
    }
   ],
   "source": [
    "decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n",
    "print(decoded_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a894003-51f6-4ccc-996f-3b9c7d5a1d70",
   "metadata": {},
   "source": [
    "- Note that the model is untrained; hence the random output texts above\n",
    "- We will train the model in the next chapter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a35278b6-9e5c-480f-83e5-011a1173648f",
   "metadata": {},
   "source": [
    "## Summary and takeaways\n",
    "\n",
    "- See the [./gpt.py](./gpt.py) script, a self-contained script containing the GPT model we implement in this Jupyter notebook\n",
    "- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }