mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
add the new chapters for part 4
This commit is contained in:
parent
6f214b20d5
commit
9633433eb0
489
Translated_Book/ch04/4.6 编码GPT模型-Copy1.ipynb
Normal file
489
Translated_Book/ch04/4.6 编码GPT模型-Copy1.ipynb
Normal file
@ -0,0 +1,489 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "46618527-15ac-4c32-ad85-6cfea83e006e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4.6 编码GPT模型"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dec7d03d-9ff3-4ca3-ad67-01b67c2f5457",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 我们快到了:现在让我们将转换器块插入到本章开头我们编写的架构中,这样我们就可以获得一个可用的GPT架构\n",
|
||||
"- 请注意,转换器块会重复多次;在最小的124M GPT-2模型中,我们重复了12次:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9b7b362d-f8c5-48d2-8ebd-722480ac5073",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/15.webp\" width=\"100%\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "324e4b5d-ed89-4fdf-9a52-67deee0593bc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 对应的代码实现,其中 `cfg[\"n_layers\"] = 12`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "c61de39c-d03c-4a32-8b57-f49ac3834857",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class GPTModel(nn.Module):\n",
|
||||
" def __init__(self, cfg):\n",
|
||||
" super().__init__()\n",
|
||||
" self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
|
||||
" self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n",
|
||||
" self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
|
||||
" \n",
|
||||
" self.trf_blocks = nn.Sequential(\n",
|
||||
" *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
|
||||
" \n",
|
||||
" self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
|
||||
" self.out_head = nn.Linear(\n",
|
||||
" cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" def forward(self, in_idx):\n",
|
||||
" batch_size, seq_len = in_idx.shape\n",
|
||||
" tok_embeds = self.tok_emb(in_idx)\n",
|
||||
" pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
|
||||
" x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]\n",
|
||||
" x = self.drop_emb(x)\n",
|
||||
" x = self.trf_blocks(x)\n",
|
||||
" x = self.final_norm(x)\n",
|
||||
" logits = self.out_head(x)\n",
|
||||
" return logits"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2750270f-c45d-4410-8767-a6adbd05d5c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 使用124M参数模型的配置,我们现在可以如下以随机初始权重实例化这个GPT模型:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"id": "ef94fd9c-4e9d-470d-8f8e-dd23d1bb1f64",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Input batch:\n",
|
||||
" tensor([[6109, 3626, 6100, 345],\n",
|
||||
" [6109, 1110, 6622, 257]])\n",
|
||||
"\n",
|
||||
"Output shape: torch.Size([2, 4, 50257])\n",
|
||||
"tensor([[[ 0.3613, 0.4222, -0.0711, ..., 0.3483, 0.4661, -0.2838],\n",
|
||||
" [-0.1792, -0.5660, -0.9485, ..., 0.0477, 0.5181, -0.3168],\n",
|
||||
" [ 0.7120, 0.0332, 0.1085, ..., 0.1018, -0.4327, -0.2553],\n",
|
||||
" [-1.0076, 0.3418, -0.1190, ..., 0.7195, 0.4023, 0.0532]],\n",
|
||||
"\n",
|
||||
" [[-0.2564, 0.0900, 0.0335, ..., 0.2659, 0.4454, -0.6806],\n",
|
||||
" [ 0.1230, 0.3653, -0.2074, ..., 0.7705, 0.2710, 0.2246],\n",
|
||||
" [ 1.0558, 1.0318, -0.2800, ..., 0.6936, 0.3205, -0.3178],\n",
|
||||
" [-0.1565, 0.3926, 0.3288, ..., 1.2630, -0.1858, 0.0388]]],\n",
|
||||
" grad_fn=<UnsafeViewBackward0>)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"torch.manual_seed(123)\n",
|
||||
"model = GPTModel(GPT_CONFIG_124M)\n",
|
||||
"\n",
|
||||
"out = model(batch)\n",
|
||||
"print(\"Input batch:\\n\", batch)\n",
|
||||
"print(\"\\nOutput shape:\", out.shape)\n",
|
||||
"print(out)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6d616e7a-568b-4921-af29-bd3f4683cd2e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"- 我们将在下一章中训练这个模型\n",
|
||||
"- 但是,关于它的大小,我们需要快速说明一下:我们之前提到它是一个包含1.24亿参数的模型;我们可以按以下方式再次确认这个数字:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "84fb8be4-9d3b-402b-b3da-86b663aac33a",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Total number of parameters: 163,009,536\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"total_params = sum(p.numel() for p in model.parameters())\n",
|
||||
"print(f\"Total number of parameters: {total_params:,}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b67d13dd-dd01-4ba6-a2ad-31ca8a9fd660",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 在原始的GPT-2论文中,研究人员应用了权重绑定(weight tying),这意味着他们重用了标记嵌入层(tok_emb)作为输出层,即设置self.out_head.weight = self.tok_emb.weight\n",
|
||||
"- 标记嵌入层将50,257维的独热编码输入标记投影到768维的嵌入表示\n",
|
||||
"- 输出层将768维的嵌入投影回50,257维的表示,以便我们可以将这些转换回单词(有关此的更多信息,请参见下一节)\n",
|
||||
"- 因此,嵌入层和输出层具有相同数量的权重参数,我们可以根据其权重矩阵的形状看到这一点\n",
|
||||
"- 但是,关于其大小的一个快速说明:我们之前将其称为124M参数模型;我们可以按以下方式检查此数字"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "e3b43233-e9b8-4f5a-b72b-a263ec686982",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Token embedding layer shape: torch.Size([50257, 768])\n",
|
||||
"Output layer shape: torch.Size([50257, 768])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\"Token embedding layer shape:\", model.tok_emb.weight.shape)\n",
|
||||
"print(\"Output layer shape:\", model.out_head.weight.shape)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f02259f6-6f79-4c89-a866-4ebeae1c3289",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"- 在原始的GPT-2论文中,研究人员将标记嵌入矩阵作为输出矩阵重新使用\n",
|
||||
"- 相应地,如果我们减去输出层的参数数量,我们将得到一个124M参数的模型:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"id": "95a22e02-50d3-48b3-a4e0-d9863343c164",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of trainable parameters considering weight tying: 124,412,160\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())\n",
|
||||
"print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "40b03f80-b94c-46e7-9d42-d0df399ff3db",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"- 在实践中,我发现没有权重绑定更容易训练模型,这就是为什么我们在这里没有实现它\n",
|
||||
"- 但是,当我们在第5章加载预训练权重时,我们将重新检查并应用这个权重绑定的想法\n",
|
||||
"- 最后,我们可以按如下方式计算模型的内存需求,这可以作为一个有用的参考点:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "5131a752-fab8-4d70-a600-e29870b33528",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Total size of the model: 621.83 MB\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
|
||||
"total_size_bytes = total_params * 4\n",
|
||||
"\n",
|
||||
"# Convert to megabytes\n",
|
||||
"total_size_mb = total_size_bytes / (1024 * 1024)\n",
|
||||
"\n",
|
||||
"print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "309a3be4-c20a-4657-b4e0-77c97510b47c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 练习:您可以尝试下面的其他配置,这些配置也在[GPT-2论文](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)中被引用。\n",
|
||||
"\n",
|
||||
"- **GPT2-small**(我们已经实现的124M配置):\n",
|
||||
" - \"emb_dim\" = 768\n",
|
||||
" - \"n_layers\" = 12\n",
|
||||
" - \"n_heads\" = 12\n",
|
||||
"\n",
|
||||
"- **GPT2-medium:**\n",
|
||||
" - \"emb_dim\" = 1024\n",
|
||||
" - \"n_layers\" = 24\n",
|
||||
" - \"n_heads\" = 16\n",
|
||||
"\n",
|
||||
"- **GPT2-large:**\n",
|
||||
" - \"emb_dim\" = 1280\n",
|
||||
" - \"n_layers\" = 36\n",
|
||||
" - \"n_heads\" = 20\n",
|
||||
"\n",
|
||||
"- **GPT2-XL:**\n",
|
||||
" - \"emb_dim\" = 1600\n",
|
||||
" - \"n_layers\" = 48\n",
|
||||
" - \"n_heads\" = 25"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "da5d9bc0-95ab-45d4-9378-417628d86e35",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4.7 Generating text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "48da5deb-6ee0-4b9b-8dd2-abed7ed65172",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- LLMs like the GPT model we implemented above are used to generate one word at a time"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "caade12a-fe97-480f-939c-87d24044edff",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/16.webp\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a7061524-a3bd-4803-ade6-2e3b7b79ac13",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The following `generate_text_simple` function implements greedy decoding, which is a simple and fast method to generate text\n",
|
||||
"- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)\n",
|
||||
"- In the next chapter, we will implement a more advanced `generate_text` function\n",
|
||||
"- The figure below depicts how the GPT model, given an input context, generates the next word token"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ee0f32c-c18c-445e-b294-a879de2aa187",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/17.webp\" width=\"600px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "c9b428a9-8764-4b36-80cd-7d4e00595ba6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
|
||||
" # idx is (batch, n_tokens) array of indices in the current context\n",
|
||||
" for _ in range(max_new_tokens):\n",
|
||||
" \n",
|
||||
" # Crop current context if it exceeds the supported context size\n",
|
||||
" # E.g., if LLM supports only 5 tokens, and the context size is 10\n",
|
||||
" # then only the last 5 tokens are used as context\n",
|
||||
" idx_cond = idx[:, -context_size:]\n",
|
||||
" \n",
|
||||
" # Get the predictions\n",
|
||||
" with torch.no_grad():\n",
|
||||
" logits = model(idx_cond)\n",
|
||||
" \n",
|
||||
" # Focus only on the last time step\n",
|
||||
" # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n",
|
||||
" logits = logits[:, -1, :] \n",
|
||||
"\n",
|
||||
" # Apply softmax to get probabilities\n",
|
||||
" probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)\n",
|
||||
"\n",
|
||||
" # Get the idx of the vocab entry with the highest probability value\n",
|
||||
" idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)\n",
|
||||
"\n",
|
||||
" # Append sampled index to the running sequence\n",
|
||||
" idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)\n",
|
||||
"\n",
|
||||
" return idx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6515f2c1-3cc7-421c-8d58-cc2f563b7030",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The `generate_text_simple` above implements an iterative process, where it creates one token at a time\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/18.webp\" width=\"600px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f682eac4-f9bd-438b-9dec-6b1cc7bc05ce",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's prepare an input example:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "3d7e3e94-df0f-4c0f-a6a1-423f500ac1d3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"encoded: [15496, 11, 314, 716]\n",
|
||||
"encoded_tensor.shape: torch.Size([1, 4])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"start_context = \"Hello, I am\"\n",
|
||||
"\n",
|
||||
"encoded = tokenizer.encode(start_context)\n",
|
||||
"print(\"encoded:\", encoded)\n",
|
||||
"\n",
|
||||
"encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n",
|
||||
"print(\"encoded_tensor.shape:\", encoded_tensor.shape)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "a72a9b60-de66-44cf-b2f9-1e638934ada4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])\n",
|
||||
"Output length: 10\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"model.eval() # disable dropout\n",
|
||||
"\n",
|
||||
"out = generate_text_simple(\n",
|
||||
" model=model,\n",
|
||||
" idx=encoded_tensor, \n",
|
||||
" max_new_tokens=6, \n",
|
||||
" context_size=GPT_CONFIG_124M[\"context_length\"]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"Output:\", out)\n",
|
||||
"print(\"Output length:\", len(out[0]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1d131c00-1787-44ba-bec3-7c145497b2c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Remove batch dimension and convert back into text:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "053d99f6-5710-4446-8d52-117fb34ea9f6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Hello, I am Featureiman Byeswickattribute argue\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n",
|
||||
"print(decoded_text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9a894003-51f6-4ccc-996f-3b9c7d5a1d70",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Note that the model is untrained; hence the random output texts above\n",
|
||||
"- We will train the model in the next chapter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a35278b6-9e5c-480f-83e5-011a1173648f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Summary and takeaways\n",
|
||||
"\n",
|
||||
"- See the [./gpt.py](./gpt.py) script, a self-contained script containing the GPT model we implement in this Jupyter notebook\n",
|
||||
"- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
228
Translated_Book/ch04/4.7 生成文本.ipynb
Normal file
228
Translated_Book/ch04/4.7 生成文本.ipynb
Normal file
@ -0,0 +1,228 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "da5d9bc0-95ab-45d4-9378-417628d86e35",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4.7 生成文本"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "48da5deb-6ee0-4b9b-8dd2-abed7ed65172",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 我们上面实现的GPT模型等LLMs(大型语言模型)被用来一次生成一个单词"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "caade12a-fe97-480f-939c-87d24044edff",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/16.webp\" width=\"100%\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a7061524-a3bd-4803-ade6-2e3b7b79ac13",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"以下是对给定文本的中英文翻译:\n",
|
||||
"\n",
|
||||
"- 以下的`generate_text_simple`函数实现了贪婪解码,这是一种简单且快速的文本生成方法\n",
|
||||
"- 在贪婪解码中,每一步,模型都会选择概率最高的词(或标记)作为下一个输出(最高的对数值对应于最高的概率,因此我们甚至不必显式计算softmax函数)\n",
|
||||
"- 在下一章中,我们将实现一个更高级的`generate_text`函数\n",
|
||||
"- 下图展示了GPT模型在给定输入上下文时如何生成下一个词标记\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ee0f32c-c18c-445e-b294-a879de2aa187",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/17.webp\" width=\"100%\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "c9b428a9-8764-4b36-80cd-7d4e00595ba6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
|
||||
" # idx is (batch, n_tokens) array of indices in the current context\n",
|
||||
" for _ in range(max_new_tokens):\n",
|
||||
" \n",
|
||||
" # Crop current context if it exceeds the supported context size\n",
|
||||
" # E.g., if LLM supports only 5 tokens, and the context size is 10\n",
|
||||
" # then only the last 5 tokens are used as context\n",
|
||||
" idx_cond = idx[:, -context_size:]\n",
|
||||
" \n",
|
||||
" # Get the predictions\n",
|
||||
" with torch.no_grad():\n",
|
||||
" logits = model(idx_cond)\n",
|
||||
" \n",
|
||||
" # Focus only on the last time step\n",
|
||||
" # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n",
|
||||
" logits = logits[:, -1, :] \n",
|
||||
"\n",
|
||||
" # Apply softmax to get probabilities\n",
|
||||
" probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)\n",
|
||||
"\n",
|
||||
" # Get the idx of the vocab entry with the highest probability value\n",
|
||||
" idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)\n",
|
||||
"\n",
|
||||
" # Append sampled index to the running sequence\n",
|
||||
" idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)\n",
|
||||
"\n",
|
||||
" return idx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6515f2c1-3cc7-421c-8d58-cc2f563b7030",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"上面的`generate_text_simple`实现了一个迭代过程,其中它一次生成一个标记\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/18.webp\" width=\"100%\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f682eac4-f9bd-438b-9dec-6b1cc7bc05ce",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 让我们准备一个输入示例:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "3d7e3e94-df0f-4c0f-a6a1-423f500ac1d3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"encoded: [15496, 11, 314, 716]\n",
|
||||
"encoded_tensor.shape: torch.Size([1, 4])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"start_context = \"Hello, I am\"\n",
|
||||
"\n",
|
||||
"encoded = tokenizer.encode(start_context)\n",
|
||||
"print(\"encoded:\", encoded)\n",
|
||||
"\n",
|
||||
"encoded_tensor = torch.tensor(encoded).unsqueeze(0)\n",
|
||||
"print(\"encoded_tensor.shape:\", encoded_tensor.shape)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "a72a9b60-de66-44cf-b2f9-1e638934ada4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])\n",
|
||||
"Output length: 10\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"model.eval() # disable dropout\n",
|
||||
"\n",
|
||||
"out = generate_text_simple(\n",
|
||||
" model=model,\n",
|
||||
" idx=encoded_tensor, \n",
|
||||
" max_new_tokens=6, \n",
|
||||
" context_size=GPT_CONFIG_124M[\"context_length\"]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"Output:\", out)\n",
|
||||
"print(\"Output length:\", len(out[0]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1d131c00-1787-44ba-bec3-7c145497b2c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 移除批次维度并转换回文本:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "053d99f6-5710-4446-8d52-117fb34ea9f6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Hello, I am Featureiman Byeswickattribute argue\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"decoded_text = tokenizer.decode(out.squeeze(0).tolist())\n",
|
||||
"print(decoded_text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9a894003-51f6-4ccc-996f-3b9c7d5a1d70",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"- 注意模型未经过训练;因此上述输出文本是随机的\n",
|
||||
"- 我们将在下一章中训练模型\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a35278b6-9e5c-480f-83e5-011a1173648f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 总结和要点\n",
|
||||
"\n",
|
||||
"- 请查看./gpt.py脚本,这是一个包含我们在本Jupyter笔记本中实现的GPT模型的独立脚本\n",
|
||||
"- 您可以在./exercise-solutions.ipynb中找到练习题的解答"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
Loading…
Reference in New Issue
Block a user