mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
674 lines
25 KiB
Plaintext
674 lines
25 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 3.3 关注自注意力机制输入的不同部分\n",
|
||
"\n",
|
||
"我们接下来将深入了解自注意力(self-attention)机制的工作原理,并学习如何从零开始编写它的代码。自注意力机制是所有基于 Transformer 架构的大语言模型的核心组成部分。需要指出的是,理解这一概念需要高度集中精神和注意力,但一旦你掌握了其基本原理,就相当于攻克了本书中最为艰难的部分并在某种意义上实现了大语言模型。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 自注意力机制中的“自”\n",
|
||
"\n",
|
||
"在自注意力机制中,“自”(self)是指该机制能够通过分析单一输入序列内不同位置的联系来计算注意力权重。它能够评估和学习输入本身各部分之间的关系与依赖,比如一句话中的词语或一个图像中的像素。这与传统的注意力机制形成对比,后者主要关注两个不同序列之间的元素关系,例如在序列对序列模型中,注意力可能位于一个输入序列与一个输出序列之间,如图 3.5 所示。\n",
|
||
"\n",
|
||
"自注意力机制可能看起来比较复杂,尤其是如果你是第一次接触它的话。因此,我们将在下一小节首先介绍一个简化版本的自注意力机制。之后,在第 3.4 节中,我们将实现带有可训练权重的自注意力机制,这种机制被广泛应用于大语言模型中。\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3.3.1 无需可训练权重的简单自注意力机制\n",
|
||
"\n",
|
||
"在本节中,我们将实现一个简化版本的自注意力机制,不涉及任何可训练的权重,该版本在图 3.7 中有所概述。本节的目的是在下一节 3.4 添加可训练权重之前,先阐释自注意力中的几个关键概念。\n",
|
||
"\n",
|
||
"**图 3.7 自注意力的目标是为每个输入元素计算一个上下文向量,该向量结合了来自所有其他输入元素的信息。在该图中所示的示例中,我们计算了上下文向量 z(2)。计算 z(2) 的每个输入元素的重要性或贡献由注意力权重 α21 到 α2T 决定。在计算 z(2) 时,注意力权重是针对输入元素 x(2) 及所有其他输入计算的。这些注意力权重的具体计算方法将在本节后面讨论。**\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"图 3.7 显示了一个输入序列, 标记为 x, 包含 T 个元素, 从 x(1) 到 x(T)。通常, 这样的序列代表了文本, 如句子, 它已经被转换为 Token 嵌入, 正如第 2 章所解释的。\n",
|
||
"\n",
|
||
"以一个输入文本 \"Your journey starts with one step.\" 为例。在这种情况下, 每个序列元素, 如 x(1), 对应于一个代表特定 Token \"Your\" 的 d 维嵌入向量。在图 3.7 中, 这些输入向量显示为三维嵌入。\n",
|
||
"\n",
|
||
"在自注意力机制中, 我们的目标是为输入序列中每个元素 x(i) 计算上下文向量 z(i)。可以将上下文向量理解为一个信息更丰富的嵌入向量。\n",
|
||
"\n",
|
||
"以 x(2) 的嵌入向量为例, 它对应于 Token \"journey\", 以及其对应的上下文向量 z(2), 如图 3.7 底部所示。这个增强的上下文向量 z(2) 包含了关于 x(2) 以及序列中所有其他元素 x(1) 到 x(T) 的信息。\n",
|
||
"\n",
|
||
"自注意力机制在这里扮演着关键角色。它的作用是通过整合序列中所有其他元素的信息, 为输入序列的每一个元素(如句子中的每一个词)创造出更丰富的表征。这对于大语言模型来说至关重要, 因为它们需要理解句子中词与词之间的联系和重要性。接下来,我们将引入可训练的权重,使大语言模型能够学习如何构建这些上下文向量,从而有效地帮助模型生成下一个Token。\n",
|
||
"\n",
|
||
"在本节中, 我们逐步实现了一个简化的自注意力机制, 以计算这些权重和相应的上下文向量。\n",
|
||
"\n",
|
||
"请参考以下输入句子, 它已经按照第 2 章的讨论被嵌入为三维向量。为了示范的需要, 我们选择了一个较小的嵌入尺寸, 确保它在页面上显示时不会被截断。\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import torch\n",
|
||
"inputs = torch.tensor(\n",
|
||
" [[0.43, 0.15, 0.89], # Your (x^1)\n",
|
||
" [0.55, 0.87, 0.66], # journey (x^2)\n",
|
||
" [0.57, 0.85, 0.64], # starts (x^3)\n",
|
||
" [0.22, 0.58, 0.33], # with (x^4)\n",
|
||
" [0.77, 0.25, 0.10], # one (x^5)\n",
|
||
" [0.05, 0.80, 0.55]] # step (x^6) \n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"实现自注意力机制的第一步是计算中间变量 ω,这些变量被称为注意力得分,如图 3.8 所示。\n",
|
||
"\n",
|
||
"**图 3.8 本节的总体目标是通过使用第二个输入序列 x(2) 作为查询来演示上下文向量 z(2) 的计算过程。此图展示了第一个中间步骤,即通过点积计算查询 x(2) 与所有其他输入元素之间的注意力得分 ω。(请注意,图中的数字为了减少视觉杂乱,小数点后数字被截断到一位。)**\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"图 3.8 展示了我们如何计算查询 Token 与每个输入 Token 之间的中间注意力得分。我们通过计算查询 x(2) 与其他每个输入 Token 的点积来确定这些得分:\n",
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = inputs[1] #A \n",
|
||
"attn_scores_2 = torch.empty(inputs.shape[0])\n",
|
||
"for i, x_i in enumerate(inputs):\n",
|
||
" attn_scores_2[i] = torch.dot(x_i, query)\n",
|
||
"print(attn_scores_2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"计算出的注意力得分如下:\n",
|
||
"```python\n",
|
||
"tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 理解点积\n",
|
||
"\n",
|
||
"点积是一个简单直接的操作,它通过对两个向量的对应元素进行相乘然后求和来完成,示例如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor(0.9544)\n",
|
||
"tensor(0.9544)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"res = 0.\n",
|
||
"\n",
|
||
"for idx, element in enumerate(inputs[0]):\n",
|
||
" res += inputs[0][idx] * query[idx]\n",
|
||
"print(res)\n",
|
||
"print(torch.dot(inputs[0], query))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"从输出结果可以看出,逐元素相乘后的和与点积的计算结果一致:\n",
|
||
"```python\n",
|
||
"tensor(0.9544)\n",
|
||
"tensor(0.9544)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"点积不仅仅是一个数学工具,它还能衡量两个向量的相似度。点积越高,表示两个向量的对齐程度或相似度越高。在自注意力机制中,点积用于衡量序列中各元素之间的关注程度:点积值越高,两个元素之间的相似性和注意力得分就越高。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"接下来的步骤,如图 3.9 所示,我们将对之前计算的每个注意力得分进行标准化处理。\n",
|
||
"\n",
|
||
"**图 3.9 在根据输入查询 x(2) 计算出注意力分数 ω21 到 ω2T 后,下一步是将这些分数归一化,以得到注意力权重 α21 到 α2T。**\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"如图 3.9 所示,进行归一化的主要目的是获取总和为 1 的注意力权重。这种归一化操作是常规做法,它不仅便于我们理解数据,还有助于保持大语言模型训练的稳定性。以下是实现这一归一化步骤的简单方法:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])\n",
|
||
"Sum: tensor(1.0000)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()\n",
|
||
"print(\"Attention weights:\", attn_weights_2_tmp)\n",
|
||
"print(\"Sum:\", attn_weights_2_tmp.sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"如输出所示,现在的注意力权重加起来等于 1:\n",
|
||
"```python\n",
|
||
"Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])\n",
|
||
"Sum: tensor(1.0000)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"在实际应用中,通常推荐使用 softmax 函数来进行归一化。这种方法在处理极端值时表现更佳,且在训练过程中提供了更优的梯度特性。下面是一个基本的 softmax 函数实现,用于归一化注意力得分:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n",
|
||
"Sum: tensor(1.0000)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"def softmax_naive(x):\n",
|
||
" return torch.exp(x) / torch.exp(x).sum(dim=0) \n",
|
||
" \n",
|
||
"attn_weights_2_naive = softmax_naive(attn_scores_2)\n",
|
||
"print(\"Attention weights:\", attn_weights_2_naive)\n",
|
||
"print(\"Sum:\", attn_weights_2_naive.sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"如输出所示,softmax 函数能够实现使得注意力权重的总和达到 1 的目标:\n",
|
||
"```python\n",
|
||
"Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n",
|
||
"Sum: tensor(1.)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"此外,softmax 函数确保注意力权重始终为正值。这意味着输出可以被解释为概率或相对重要性,高权重代表更大的重要性。\n",
|
||
"\n",
|
||
"值得注意的是,这种简单的 softmax 函数实现(softmax_naive)在处理大或小输入值时可能面临数值不稳定的问题,例如溢出和下溢。因此,在实际应用中,推荐使用 PyTorch 的 softmax 函数实现,这种实现方法已经针对性能进行了深入优化:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n",
|
||
"Sum: tensor(1.)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"attn_weights_2 = torch.softmax(attn_scores_2, dim=0)\n",
|
||
"print(\"Attention weights:\", attn_weights_2)\n",
|
||
"print(\"Sum:\", attn_weights_2.sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"从结果来看,这与我们之前使用的简单 softmax_native 函数得到的结果是一致的:\n",
|
||
"```python\n",
|
||
"Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n",
|
||
"Sum: tensor(1.)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"现在我们已经计算了归一化的注意力权重,接下来就是图 3.10 所示的最后一步:通过将嵌入的输入 Token x(i) 与相应的注意力权重相乘,然后将结果向量求和,计算出上下文向量 z(2)。\n",
|
||
"\n",
|
||
"**图 3.10 在计算并归一化注意力分数以获取查询 x(2) 的注意力权重后,下一步是计算上下文向量 z(2)。这个上下文向量是所有输入向量 x(1) 到 x(T) 通过注意力权重加权的组合。**\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"图 3.10 中展示的上下文向量 z(2) 是通过所有输入向量的加权求和计算得到的。具体操作是将每个输入向量乘以其相应的注意力权重:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([0.4419, 0.6515, 0.5683])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = inputs[1] # 2nd input token is the query\n",
|
||
"context_vec_2 = torch.zeros(query.shape)\n",
|
||
"for i,x_i in enumerate(inputs):\n",
|
||
" context_vec_2 += attn_weights_2[i]*x_i\n",
|
||
"print(context_vec_2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"计算的结果如下:\n",
|
||
"```python\n",
|
||
"tensor([0.4419, 0.6515, 0.5683])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"下一节中,我们将扩展这一过程,同时计算所有上下文向量。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3.3.2 计算所有输入 Token 的注意力权重\n",
|
||
"\n",
|
||
"前一节,我们为输入 2 计算了注意力权重和上下文向量,如图 3.11 中高亮的行所示。现在我们将这种计算扩展到所有输入的注意力权重和上下文向量。\n",
|
||
"\n",
|
||
"**图 3.11 的高亮行显示了我们之前为第二个输入元素作为查询所计算的注意力权重。本节将扩展这一计算过程以获得所有其他注意力权重。**\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"我们将遵循之前相同的三个步骤,如图 3.12 所总结,但在代码中进行了修改,计算所有上下文向量而非仅仅是第二个 z(2)。\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"首先,在图 3.12 的第一步中,我们添加了一个额外的循环来计算所有输入对的点积。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n",
|
||
" [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n",
|
||
" [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n",
|
||
" [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n",
|
||
" [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n",
|
||
" [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"attn_scores = torch.empty(6, 6)\n",
|
||
"for i, x_i in enumerate(inputs):\n",
|
||
" for j, x_j in enumerate(inputs):\n",
|
||
" attn_scores[i, j] = torch.dot(x_i, x_j)\n",
|
||
"print(attn_scores)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"得到的注意力分数如下:\n",
|
||
"```python\n",
|
||
"tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n",
|
||
" [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n",
|
||
" [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n",
|
||
" [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n",
|
||
" [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n",
|
||
" [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"如图 3.11 所示,前述张量中的每个元素代表每对输入之间的注意力分数。需要注意的是,图 3.11 中的值已经归一化,这也是为什么它们与前述张量中的未归一化注意力分数不同。我们将稍后处理归一化问题。\n",
|
||
"\n",
|
||
"在计算前面的注意力分数张量时,我们使用了 Python 中的 for 循环。然而,for 循环通常较慢,我们可以通过矩阵乘法达到同样的效果:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n",
|
||
" [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n",
|
||
" [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n",
|
||
" [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n",
|
||
" [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n",
|
||
" [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"attn_scores = inputs @ inputs.T\n",
|
||
"print(attn_scores)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"我们可以看到结果与之前相同:\n",
|
||
"```python\n",
|
||
"tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n",
|
||
" [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n",
|
||
" [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n",
|
||
" [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n",
|
||
" [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n",
|
||
" [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"在第二步中,我们将每行的值归一化,使其总和为 1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],\n",
|
||
" [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],\n",
|
||
" [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],\n",
|
||
" [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],\n",
|
||
" [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],\n",
|
||
" [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"attn_weights = torch.softmax(attn_scores, dim=1)\n",
|
||
"print(attn_weights)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"这返回了与图 3.10 所显示的值匹配的注意力权重张量:\n",
|
||
"```python\n",
|
||
"tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],\n",
|
||
" [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],\n",
|
||
" [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],\n",
|
||
" [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],\n",
|
||
" [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],\n",
|
||
" [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"在我们进行到图 3.12 所示的第三步、也是最后一步之前,让我们简单确认一下这些行的确都加起来等于 1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Row 2 sum: 1.0\n",
|
||
"All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n",
|
||
"print(\"Row 2 sum:\", row_2_sum)\n",
|
||
"print(\"All row sums:\", attn_weights.sum(dim=1))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"返回的结果如下:\n",
|
||
"```python\n",
|
||
"Row 2 sum: 1.0\n",
|
||
"All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"在第三步也是最后一步中,我们将利用这些注意力权重通过矩阵乘法来生成所有的上下文向量:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[0.4421, 0.5931, 0.5790],\n",
|
||
" [0.4419, 0.6515, 0.5683],\n",
|
||
" [0.4431, 0.6496, 0.5671],\n",
|
||
" [0.4304, 0.6298, 0.5510],\n",
|
||
" [0.4671, 0.5910, 0.5266],\n",
|
||
" [0.4177, 0.6503, 0.5645]])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"all_context_vecs = attn_weights @ inputs\n",
|
||
"print(all_context_vecs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"source": [
|
||
"在得到的输出张量中,每一行都包含一个三维的上下文向量:\n",
|
||
"```python\n",
|
||
"tensor([[0.4421, 0.5931, 0.5790],\n",
|
||
" [0.4419, 0.6515, 0.5683],\n",
|
||
" [0.4431, 0.6496, 0.5671],\n",
|
||
" [0.4304, 0.6298, 0.5510],\n",
|
||
" [0.4671, 0.5910, 0.5266],\n",
|
||
" [0.4177, 0.6503, 0.5645]])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"我们可以通过将第二行与我们在 3.3.1 节中之前计算的上下文向量 z(2) 进行对比,来核实代码是否正确:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"trusted": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"Previous 2nd context vector:\", context_vec_2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"根据结果,我们可以看到之前计算的 context_vec_2 与前一个张量中的第二行完全匹配:\n",
|
||
"```python\n",
|
||
"Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"至此,我们完成了一个简单自注意力机制的代码演示。在下一节中,我们将添加可训练的权重,使大语言模型能够从数据中学习,并提高其在特定任务上的性能。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "minitorch",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|