mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-04-25 08:58:17 +08:00
Merge remote-tracking branch 'origin/main'
This commit is contained in:
commit
75267fbf8e
2
.gitignore
vendored
2
.gitignore
vendored
@ -173,5 +173,5 @@ gpt2/
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
#.idea/
|
||||
/Model_Architecture_Discussions/ChatGLM3/weights/
|
||||
.idea/
|
||||
|
||||
401
Translated_Book/ch02/2.4添加特殊上下文tokens.ipynb
Normal file
401
Translated_Book/ch02/2.4添加特殊上下文tokens.ipynb
Normal file
@ -0,0 +1,401 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3cdf73ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2.4 添加特殊的上下文tokens"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1019c5ac",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在上一节中,我们实现了一个简单的分词器,并将其应用于训练集中的一段。在本节中,我们将修改这个分词器来处理未知单词。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ba5b6a1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们还将讨论特殊上下文标记的使用和添加,这些标记可以增强模型对文本中上下文或其他相关信息的理解。例如,这些特殊标记可以包括未知单词和文档边界的标记。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8b9bb7be",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"特别是,我们将修改在上一节SimpleTokenizerV2中实现的词汇表和标记器,以支持两个新标记<|UNK|>和<|内文|如图2.9所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e8457f8c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图2.9 我们在词汇表中添加特殊的标记来处理特定的上下文。例如,我们添加<|UNK|> token表示新的和未知的单词,这些单词不是训练数据的一部分,因此也不是现有词汇表的一部分。此外,我们还添加了一个<|内文|> token,我们可以使用它来分隔两个不相关的文本源。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "490fa60b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "233806c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"如图2.9所示,我们可以修改tokenizer以使用<|UNK|> token,如果它遇到一个不属于词汇表的单词。此外,我们在不相关的文本之间添加标记。例如,当在多个独立的文档或书籍上训练类似GPT的LLMs时,通常会在前一个文本源之后的每个文档或书籍之前插入一个令牌,如图2.10所示。\n",
|
||||
"这有助于LLM理解,尽管这些文本源是为了训练而连接的,但它们实际上是不相关的。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "86ded03a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图2.10当处理多个独立的文本源时,我们在这些文本间添加叫做<|endoftext|>的tokens。这些<|endoftext|>tokens作为标记,标志着一个特定段落的开始和结束,这使得LLM能更有效地处理和理解文本。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "acc76cd1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5792917b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"现在让我们修改词汇表,以包含这两个特殊的token,<unk>以及<|endoftext|>,并将它们添加到我们在上一节中创建的唯一词表中:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "38439456",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "NameError",
|
||||
"evalue": "name 'preprocessed' is not defined",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||||
"Cell \u001b[1;32mIn[4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m all_tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(\u001b[43mpreprocessed\u001b[49m)))\n\u001b[0;32m 2\u001b[0m all_tokens\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|endoftext|>\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|unk|>\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m 3\u001b[0m vocab \u001b[38;5;241m=\u001b[39m {token:integer \u001b[38;5;28;01mfor\u001b[39;00m integer,token \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(all_tokens)}\n",
|
||||
"\u001b[1;31mNameError\u001b[0m: name 'preprocessed' is not defined"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_tokens = sorted(list(set(preprocessed)))\n",
|
||||
"all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n",
|
||||
"vocab = {token:integer for integer,token in enumerate(all_tokens)}\n",
|
||||
"print(len(vocab.items()))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6c7a776a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"根据print语句的输出,新的词表大小为1161(上一节中的词表大小为1159)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7688ecd4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"作为额外的快速检查,让我们打印更新词汇表的最后5个词:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "f08e3a9d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "NameError",
|
||||
"evalue": "name 'vocab' is not defined",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||||
"Cell \u001b[1;32mIn[5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, item \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[43mvocab\u001b[49m\u001b[38;5;241m.\u001b[39mitems())[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m5\u001b[39m:]):\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(item)\n",
|
||||
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, item in enumerate(list(vocab.items())[-5:]):\n",
|
||||
" print(item)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9bd79d47",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"上面的代码打印如下内容:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4eb990f9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"('younger', 1156)\n",
|
||||
"('your', 1157)\n",
|
||||
"('yourself', 1158)\n",
|
||||
"('<|endoftext|>', 1159)\n",
|
||||
"('<|unk|>', 1160)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "725cfe23",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"根据上面的代码输出,我们可以确认这两个新的特殊token确实成功地合并到了词表中。接下来,我们相应地调整代码清单2.3中的tokenizer,如清单2.4所示:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "930dca4b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**清单2.4一个处理未知单词的简单文本标记器**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "31a26133",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class SimpleTokenizerV2:\n",
|
||||
" def __init__(self, vocab):\n",
|
||||
" self.str_to_int = vocab\n",
|
||||
" self.int_to_str = { i:s for s,i in vocab.items()}\n",
|
||||
" def encode(self, text):\n",
|
||||
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
|
||||
" preprocessed = [item.strip() for item in preprocessed if item.strip()]\n",
|
||||
" preprocessed = [item if item in self.str_to_int else \"<|unk|>\" for item in preprocessed] #A\n",
|
||||
" ids = [self.str_to_int[s] for s in preprocessed]\n",
|
||||
" return ids\n",
|
||||
" def decode(self, ids):\n",
|
||||
" text = \" \".join([self.int_to_str[i] for i in ids])\n",
|
||||
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n",
|
||||
" return text\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "273cda30",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"与我们在上一节的代码清单2.3中实现的SimpleTokenizerV1相比,新的SimpleTokenizerV2将未知单词替换为<|UNK|>token。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f7f3040c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"现在让我们在实践中尝试这个新的标记器。为此,我们将使用一个简单的文本示例,它是由两个独立且不相关的句子连接而成的:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "19a04bd4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text1 = \"Hello, do you like tea?\"\n",
|
||||
"text2 = \"In the sunlit terraces of the palace.\"\n",
|
||||
"text = \" <|endoftext|> \".join((text1, text2))\n",
|
||||
"print(text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bcb6a7dc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"输出如下:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "157b8c26",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c64915c5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"接下来,让我们使用SimpleTokenizerV2对我们之前在清单2.2中创建的vocab进行tokenizer:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "162d3403",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "NameError",
|
||||
"evalue": "name 'vocab' is not defined",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||||
"Cell \u001b[1;32mIn[3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m SimpleTokenizerV2(\u001b[43mvocab\u001b[49m)\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(tokenizer\u001b[38;5;241m.\u001b[39mencode(text))\n",
|
||||
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"tokenizer = SimpleTokenizerV2(vocab)\n",
|
||||
"print(tokenizer.encode(text))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f532e55",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"这将打印以下token ID:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "55d238cf",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a7988b14",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以看到token ID列表包含1159,即<|endoftext|>分隔符标记;以及两个1160,用于标记未知单词。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ac7c9571",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们对文本进行去de-tokenize操作,以进行快速的健全性检查:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "643173b0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(tokenizer.decode(tokenizer.encode(text)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1f6c9917",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"输出如下所示:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "203d93b6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3ca2a152",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"通过将上面的de-tokenize文本与原始输入文本进行比较,我们知道训练数据集,Edith Wharton的短篇小说The Verdict,不包含单词\"Hello\"和\"palace\"。\n",
|
||||
"\n",
|
||||
"到目前为止,我们已经讨论了tokenization,这是处理作为LLMs输入的文本的重要步骤。根据LLM的不同,一些研究人员还会考虑其他的特殊token,例如:\n",
|
||||
"\n",
|
||||
"·[BOS](beginning of sequence):此标记标记文本的开始。LLM表示一段内容开始的位置。</br>\n",
|
||||
"·[EOS](end of sequence):这个标记位于文本的末尾,在连接多个不相关的文本时特别有用,类似于<|内文|>.例如,当合并两个不同的维基百科文章或书籍时,[EOS]令牌指示一篇文章结束的位置和下一篇文章开始的位置。</br>\n",
|
||||
"·[PAD](padding):当训练批量大小大于1的LLMs时,该批可能包含不同长度的文本。为了确保所有文本具有相同的长度,使用[PAD]标记扩展或“填充”较短的文本,直到批次中最长文本的长度。</br>\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c6120349",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"注意,用于GPT模型的标记器不需要上面提到的任何标记,而只使用<|内文|> token for simplicity.的<|内文|”这是一个类似于上面提到的[EOS]令牌。此外,<|内文|“也是用来填充的。然而,正如我们将在后续章节中探索的那样,当在批量输入上训练时,我们通常使用掩码,这意味着我们不关注填充的令牌。因此,选择用于填充的特定令牌变得无关紧要。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d3367570",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"此外,用于GPT模型的tokenizer也不使用<|UNK|>用于词表外的单词的标记。相反,GPT模型使用字节对编码标记器,它将单词分解为子单词单元,这部分我们将在下一节中讨论。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ce78d469",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "llm_from_scratch",
|
||||
"language": "python",
|
||||
"name": "llm_from_scratch"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.14"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@ -75,7 +75,7 @@
|
||||
"import tiktoken\n",
|
||||
"\n",
|
||||
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
|
||||
"with open(\"/Users/zhihu123/Project/other/llms-from-scratch-cn/ch02/01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
|
||||
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
|
||||
" raw_text = f.read()\n",
|
||||
"enc_text = tokenizer.encode(raw_text)\n",
|
||||
"print(len(enc_text))"
|
||||
@ -441,7 +441,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"with open(\"/Users/zhihu123/Project/other/llms-from-scratch-cn/ch02/01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
|
||||
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
|
||||
" raw_text = f.read()\n",
|
||||
" dataloader = create_dataloader_v1(\n",
|
||||
" raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)\n",
|
||||
|
||||
214
Translated_Book/ch02/2.7 构建词符嵌入.ipynb
Normal file
214
Translated_Book/ch02/2.7 构建词符嵌入.ipynb
Normal file
@ -0,0 +1,214 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2cd2fcda-2fda-4aa8-8bc8-de1e496f9db1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2.7 构建词符嵌入"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1a301068-6ab2-44ff-a915-1ba11688274f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 我们准备用于大语言模型(LLM)的数据已经差不多就绪了\n",
|
||||
"- 接下来,我们要做的最后一步是使用嵌入层将 token 嵌入到连续的向量表示中。token本身不可计算,需要将其映射到一个连续向量空间,才可以进行后续运算,这个映射的结果就是该token对应的embedding\n",
|
||||
"- 通常,这些用来转换词符的嵌入层是大语言模型(LLM)的一部分,并且在模型训练的过程中会不断调整和优化。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e85089aa-8671-4e5f-a2b3-ef252004ee4c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-15.jpg?raw=true\" width=\"400px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "44e014ca-1fc5-4b90-b6fa-c2097bb92c0b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 假设我们在分词后有以下四个输入示例,对应的输入ID分别是5、1、3和2:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "15a6304c-9474-4470-b85d-3991a49fa653",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import torch\n",
|
||||
"input_ids = torch.tensor([2, 3, 5, 1])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "14da6344-2c71-4837-858d-dd120005ba05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 为了简化问题,假设我们有一个只包含6个单词的小型词汇表,我们想要创建大小为3的嵌入。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "93cb2cee-9aa6-4bb8-8977-c65661d16eda",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vocab_size = 6\n",
|
||||
"output_dim = 3\n",
|
||||
"\n",
|
||||
"torch.manual_seed(123)\n",
|
||||
"embedding_layer = torch.nn.Embedding(vocab_size, output_dim)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ff241f6-78eb-4e4a-a55f-5b2b6196d5b0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 这将会生成一个6x3的权重矩阵:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "a686eb61-e737-4351-8f1c-222913d47468",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Parameter containing:\n",
|
||||
"tensor([[ 0.3374, -0.1778, -0.1690],\n",
|
||||
" [ 0.9178, 1.5810, 1.3010],\n",
|
||||
" [ 1.2753, -0.2010, -0.1606],\n",
|
||||
" [-0.4015, 0.9666, -1.1481],\n",
|
||||
" [-1.1589, 0.3255, -0.6315],\n",
|
||||
" [-2.8400, -0.7849, -1.4096]], requires_grad=True)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(embedding_layer.weight)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5e54d5f1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 由于嵌入层只是独热编码和矩阵乘法方法的一种更高效的实现,它可以被视为一个可以通过反向传播进行优化的神经网络层。\n",
|
||||
"- 对于那些熟悉独热编码的人来说,上述嵌入层的方法本质上只是实现独热编码后进行矩阵乘法的一种更高效的手段,这种方法在全连接层中使用,其详细说明可以在补充代码[./embedding_vs_matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul)中找到。\n",
|
||||
"- 因为嵌入层只是独热编码和矩阵乘法方法的一种更高效的实现,所以它可以被视为一个可以通过反向传播算法进行优化的神经网络层。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4b0d58c3-83c0-4205-aca2-9c48b19fd4a7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 要将ID为3的词符转换为一个3维向量,我们执行以下步骤:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "e43600ba-f287-4746-8ddf-d0f71a9023ca",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(embedding_layer(torch.tensor([3])))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a7bbf625-4f36-491d-87b4-3969efb784b0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 注意,上述内容是`embedding_layer`权重矩阵中的第4行。\n",
|
||||
"- 为了嵌入上面所有的四个`input_ids`值,我们执行以下操作:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "50280ead-0363-44c8-8c35-bb885d92c8b7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tensor([[ 1.2753, -0.2010, -0.1606],\n",
|
||||
" [-0.4015, 0.9666, -1.1481],\n",
|
||||
" [-2.8400, -0.7849, -1.4096],\n",
|
||||
" [ 0.9178, 1.5810, 1.3010]], grad_fn=<EmbeddingBackward0>)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(embedding_layer(input_ids))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "be97ced4-bd13-42b7-866a-4d699a17e155",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 嵌入层本质上是一种查找操作:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f33c2741-bf1b-4c60-b7fd-61409d556646",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-16.jpg?raw=true\" width=\"500px\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "08218d9f-aa1a-4afb-a105-72ff96a54e73",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- **您可能对比较嵌入层与常规线性层的附加内容感兴趣:[../03_bonus_embedding-vs-matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul)**"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
1950
Translated_Book/ch05/5.1 在未标记的数据上进行预训练.ipynb
Normal file
1950
Translated_Book/ch05/5.1 在未标记的数据上进行预训练.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
BIN
Translated_Book/img/fig-2-20.png
Normal file
BIN
Translated_Book/img/fig-2-20.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 86 KiB |
BIN
Translated_Book/img/fig-2-21.png
Normal file
BIN
Translated_Book/img/fig-2-21.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 139 KiB |
Loading…
Reference in New Issue
Block a user