llms-from-scratch-cn/Translated_Book/ch04/4.5 在transfomer模块中连接注意力层和线性层.ipynb
2024-04-24 19:43:03 +08:00

363 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "4be7fb0b",
"metadata": {},
"source": [
"# 4.5 在transfomer模块中连接注意力层和线性层"
]
},
{
"cell_type": "markdown",
"id": "8adbb40c",
"metadata": {},
"source": [
"在本节中我们将实现transfomer模块这是GPT和其他大型语言模型LLM架构的基本构建块。\n",
"该模块在 1.24 亿参数的 GPT-2 架构中重复了十几次结合了我们之前介绍过的几个概念多头注意力、层归一化、dropout、前馈层和 GELU 激活函数,如图 4.13 所示。\n",
"在下一节中我们将把这个ransfomer模块连接到 GPT 架构的其余部分。"
]
},
{
"cell_type": "markdown",
"id": "d49496c5",
"metadata": {},
"source": [
"**图 4.13 transfomer模块的图示。\n",
"图的底部显示了输入的token这些标记已经被嵌入到768维的向量中。\n",
"每一行对应一个token的向量表示。\n",
"transfomer模块的输出是与输入具有相同维度的向量然后可以将其馈送到 LLM 中的后续层。**"
]
},
{
"cell_type": "markdown",
"id": "41853c9f",
"metadata": {},
"source": [
"![fig4.13](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-4-13.jpg?raw=true)"
]
},
{
"cell_type": "markdown",
"id": "5bda05b2",
"metadata": {},
"source": [
"如图 4.13 所示transfomer模块结合了多个组件包括第 3 章中的屏蔽多头注意力模块和我们在第 4.3 节中实现的前馈模块。"
]
},
{
"cell_type": "markdown",
"id": "996e660c",
"metadata": {},
"source": [
"当transfomer模块处理输入序列时序列中的每个元素例如单词或子词标记都由固定大小的向量表示在图 4.13 的情况下,为 768 维)。\n",
"transfomer模块内部的操作包括多头注意力和前馈层都旨在以保持其维度的方式转换这些向量。"
]
},
{
"cell_type": "markdown",
"id": "0f9ea84b",
"metadata": {},
"source": [
"多头注意力模块中的自注意力机制的思想是,它能够识别并分析输入序列中各元素之间的关系。\n",
"与此同时,前馈网络在每个位置独立地修改数据。\n",
"这种组合不仅可以更细致地理解和处理输入,还可以增强模型处理复杂数据模式的整体能力。"
]
},
{
"cell_type": "markdown",
"id": "09ad2b03",
"metadata": {},
"source": [
"在下面代码中,我们可以按照如下方法创建 Transformer模块"
]
},
{
"cell_type": "markdown",
"id": "7b878325",
"metadata": {},
"source": [
"### 代码示例4.6 GPT 的transfomer模块组件"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f57d53a1",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"from torch.nn import LayerNorm"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d8a3199b",
"metadata": {},
"outputs": [],
"source": [
"class TransformerBlock(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" self.att = MultiHeadAttention(\n",
" d_in=cfg[\"emb_dim\"],\n",
" d_out=cfg[\"emb_dim\"],\n",
" context_length=cfg[\"context_length\"],\n",
" num_heads=cfg[\"n_heads\"],\n",
" dropout=cfg[\"drop_rate\"],\n",
" qkv_bias=cfg[\"qkv_bias\"])\n",
"\n",
" self.ff = FeedForward(cfg)\n",
" self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
" self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
" self.drop_resid = nn.Dropout(cfg[\"drop_rate\"])\n",
" \n",
" def forward(self, x):\n",
" #A\n",
" shortcut = x\n",
" x = self.norm1(x)\n",
" x = self.att(x)\n",
" x = self.drop_resid(x)\n",
" x = x + shortcut # Add the original input back\n",
" \n",
" shortcut = x #B\n",
" x = self.norm2(x)\n",
" x = self.ff(x)\n",
" x = self.drop_resid(x)\n",
" x = x + shortcut #C\n",
" return x"
]
},
{
"cell_type": "markdown",
"id": "4d1f6344",
"metadata": {},
"source": [
"上述给出的代码在 PyTorch 中定义了一个 TransformerBlock 类包括一个多头注意力机制MultiHeadAttention和一个前馈网络FeedForward这两者都是根据提供的配置字典例如 GPT_CONFIG_124M来配置的。"
]
},
{
"cell_type": "markdown",
"id": "73c84673",
"metadata": {},
"source": [
"在这两个组件之前应用层归一化 (LayerNorm),并在它们之后应用 dropout以正则化模型并防止过度拟合。\n",
"这也称为 Pre-LayerNorm。\n",
"在较旧的架构中例如原始的transfomer模块层归一化是在自注意力和前馈网络之后应用的被称为 Post-LayerNorm这通常会导致较差的训练动态。"
]
},
{
"cell_type": "markdown",
"id": "238742e6",
"metadata": {},
"source": [
"该类还实现了前向传递,其中每个组件后面都有一个快捷连接,该连接将块的输入添加到其输出。\n",
"一关键特性有助于训练过程中梯度在网络中的流动并如4.4节所解释的那样,改善了深度模型的学习效果。"
]
},
{
"cell_type": "markdown",
"id": "64d75abd",
"metadata": {},
"source": [
"使用我们之前定义的 GPT_CONFIG_124M 字典让我们实例化一个transfomer模块并为其提供一些示例数据。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b4592fed",
"metadata": {},
"outputs": [],
"source": [
"class FeedForward(nn.Module):\n",
" def __init__(self, cfg):\n",
" super().__init__()\n",
" self.linear1 = nn.Linear(cfg[\"emb_dim\"], cfg[\"emb_dim\"] * 4)\n",
" self.relu = nn.ReLU()\n",
" self.linear2 = nn.Linear(cfg[\"emb_dim\"] * 4, cfg[\"emb_dim\"])\n",
" self.dropout = nn.Dropout(cfg[\"drop_rate\"])\n",
"\n",
" def forward(self, x):\n",
" x = self.relu(self.linear1(x))\n",
" x = self.dropout(x)\n",
" x = self.linear2(x)\n",
" return x\n",
" \n",
"class MultiHeadAttention(nn.Module):\n",
" def __init__(self, d_in, d_out,\n",
" context_length, dropout, num_heads, qkv_bias=False):\n",
" super().__init__()\n",
" assert d_out % num_heads == 0, \"d_out must be divisible by num_heads\"\n",
" self.d_out = d_out\n",
" self.num_heads = num_heads\n",
" self.head_dim = d_out // num_heads #A\n",
" self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
" self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
" self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
" self.out_proj = nn.Linear(d_out, d_out) #B\n",
" self.dropout = nn.Dropout(dropout)\n",
" self.register_buffer(\n",
" 'mask',\n",
" torch.triu(torch.ones(context_length, context_length), diagonal=1)\n",
" )\n",
" def forward(self, x):\n",
" b, num_tokens, d_in = x.shape\n",
" keys = self.W_key(x) #C\n",
" queries = self.W_query(x) #C\n",
" values = self.W_value(x) #C\n",
" \n",
" keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) #D\n",
" values = values.view(b, num_tokens, self.num_heads, self.head_dim) #D\n",
" queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)#D\n",
"\n",
" keys = keys.transpose(1, 2) #E\n",
" queries = queries.transpose(1, 2) #E\n",
" values = values.transpose(1, 2) #E\n",
"\n",
" attn_scores = queries @ keys.transpose(2, 3) #F\n",
" mask_bool = self.mask.bool()[:num_tokens, :num_tokens] #G\n",
"\n",
" attn_scores.masked_fill_(mask_bool, -torch.inf) #H\n",
"\n",
" attn_weights = torch.softmax(\n",
" attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
" attn_weights = self.dropout(attn_weights)\n",
"\n",
" context_vec = (attn_weights @ values).transpose(1, 2) #I\n",
" #J\n",
" context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n",
" context_vec = self.out_proj(context_vec) #K\n",
" return context_vec\n",
"\n",
"GPT_CONFIG_124M = {\n",
" \"vocab_size\": 50257, # Vocabulary size\n",
" \"context_length\": 1024, # Context length\n",
" \"emb_dim\": 768, # Embedding dimension\n",
" \"n_heads\": 12, # Number of attention heads\n",
" \"n_layers\": 12, # Number of layers\n",
" \"drop_rate\": 0.1, # Dropout rate\n",
" \"qkv_bias\": False # Query-Key-Value bias\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c0bc169c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input shape: torch.Size([2, 4, 768])\n",
"Output shape: torch.Size([2, 4, 768])\n"
]
}
],
"source": [
"torch.manual_seed(123)\n",
"x = torch.rand(2, 4, 768) #A\n",
"block = TransformerBlock(GPT_CONFIG_124M)\n",
"output = block(x)\n",
"print(\"Input shape:\", x.shape)\n",
"print(\"Output shape:\", output.shape)"
]
},
{
"cell_type": "markdown",
"id": "55d9528c",
"metadata": {},
"source": [
"输出如下所示:"
]
},
{
"cell_type": "markdown",
"id": "4988db16",
"metadata": {},
"source": [
"Input shape: torch.Size([2, 4, 768]) \\\n",
"Output shape: torch.Size([2, 4, 768])"
]
},
{
"cell_type": "markdown",
"id": "4c115f72",
"metadata": {},
"source": [
"从代码输出中我们可以看到transfomer模块在其输出中保持了输入的维度这表明transfomer架构在整个网络中处理数据序列时不改变其形状。"
]
},
{
"cell_type": "markdown",
"id": "9bd353ef",
"metadata": {},
"source": [
"在整个transfomer模块架构中保持形状并不是偶然的而是其设计的一个关键方面。\n",
"这种设计使其能够有效应用于广泛的序列到序列任务,其中每个输出向量直接对应于输入向量,从而保持一对一的关系。\n",
"然而,正如我们在第 3 章中学到的那样,输出是一个上下文向量,它封装了整个输入序列的信息。\n",
"这意味着虽然序列的物理维度长度和特征大小在通过transfomer模块时保持不变但每个输出向量的内容被重新编码以整合来自整个输入序列的上下文信息。"
]
},
{
"cell_type": "markdown",
"id": "bd1279d7",
"metadata": {},
"source": [
"在本节中实现的transfomer模块让我们拥有了所有构建块正如图4.14所示这些构建块是在下一节中实现GPT架构所需要的。"
]
},
{
"cell_type": "markdown",
"id": "2e82d8a5",
"metadata": {},
"source": [
"**图 4.14 展示了迄今为止我们在本章中实现的不同概念的心智模型。**"
]
},
{
"cell_type": "markdown",
"id": "21776f36",
"metadata": {},
"source": [
"![fig4.14](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-4-14.jpg?raw=true)"
]
},
{
"cell_type": "markdown",
"id": "7379f576",
"metadata": {},
"source": [
"如图 4.14 所示transfomer模块结合了层归一化、前馈网络包括 GELU 激活)和快捷连接,我们在本章早些时候已经介绍过。\n",
"如我们将在即将到来的章节中看到的这个transfomer模块将构成我们将要实现的GPT架构的主要组成部分。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (cell)",
"language": "python",
"name": "cell"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}