Merge branch 'main' into main

This commit is contained in:
tan90º 2024-05-12 20:55:53 +08:00 committed by GitHub
commit cdd57ed068
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 2568 additions and 3 deletions

2
.gitignore vendored
View File

@ -173,5 +173,5 @@ gpt2/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
/Model_Architecture_Discussions/ChatGLM3/weights/
.idea/

View File

@ -0,0 +1,401 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3cdf73ca",
"metadata": {},
"source": [
"# 2.4 添加特殊的上下文tokens"
]
},
{
"cell_type": "markdown",
"id": "1019c5ac",
"metadata": {},
"source": [
"在上一节中,我们实现了一个简单的分词器,并将其应用于训练集中的一段。在本节中,我们将修改这个分词器来处理未知单词。"
]
},
{
"cell_type": "markdown",
"id": "7ba5b6a1",
"metadata": {},
"source": [
"我们还将讨论特殊上下文标记的使用和添加,这些标记可以增强模型对文本中上下文或其他相关信息的理解。例如,这些特殊标记可以包括未知单词和文档边界的标记。"
]
},
{
"cell_type": "markdown",
"id": "8b9bb7be",
"metadata": {},
"source": [
"特别是我们将修改在上一节SimpleTokenizerV2中实现的词汇表和标记器以支持两个新标记<|UNK|>和<|内文|如图2.9所示。"
]
},
{
"cell_type": "markdown",
"id": "e8457f8c",
"metadata": {},
"source": [
"**图2.9 我们在词汇表中添加特殊的标记来处理特定的上下文。例如,我们添加<|UNK|> token表示新的和未知的单词这些单词不是训练数据的一部分因此也不是现有词汇表的一部分。此外我们还添加了一个<|内文|> token我们可以使用它来分隔两个不相关的文本源。**"
]
},
{
"cell_type": "markdown",
"id": "490fa60b",
"metadata": {},
"source": [
"![fig2.20](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-20.jpg?raw=true)"
]
},
{
"cell_type": "markdown",
"id": "233806c3",
"metadata": {},
"source": [
"如图2.9所示我们可以修改tokenizer以使用<|UNK|> token如果它遇到一个不属于词汇表的单词。此外我们在不相关的文本之间添加标记。例如当在多个独立的文档或书籍上训练类似GPT的LLMs时通常会在前一个文本源之后的每个文档或书籍之前插入一个令牌如图2.10所示。\n",
"这有助于LLM理解尽管这些文本源是为了训练而连接的但它们实际上是不相关的。"
]
},
{
"cell_type": "markdown",
"id": "86ded03a",
"metadata": {},
"source": [
"**图2.10当处理多个独立的文本源时,我们在这些文本间添加叫做<|endoftext|>的tokens。这些<|endoftext|>tokens作为标记标志着一个特定段落的开始和结束这使得LLM能更有效地处理和理解文本。**"
]
},
{
"cell_type": "markdown",
"id": "acc76cd1",
"metadata": {},
"source": [
"![fig2.21](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-21.jpg?raw=true)"
]
},
{
"cell_type": "markdown",
"id": "5792917b",
"metadata": {},
"source": [
"现在让我们修改词汇表以包含这两个特殊的token<unk>以及<|endoftext|>,并将它们添加到我们在上一节中创建的唯一词表中:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "38439456",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'preprocessed' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m all_tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(\u001b[43mpreprocessed\u001b[49m)))\n\u001b[0;32m 2\u001b[0m all_tokens\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|endoftext|>\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|unk|>\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m 3\u001b[0m vocab \u001b[38;5;241m=\u001b[39m {token:integer \u001b[38;5;28;01mfor\u001b[39;00m integer,token \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(all_tokens)}\n",
"\u001b[1;31mNameError\u001b[0m: name 'preprocessed' is not defined"
]
}
],
"source": [
"all_tokens = sorted(list(set(preprocessed)))\n",
"all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n",
"vocab = {token:integer for integer,token in enumerate(all_tokens)}\n",
"print(len(vocab.items()))\n"
]
},
{
"cell_type": "markdown",
"id": "6c7a776a",
"metadata": {},
"source": [
"根据print语句的输出新的词表大小为1161上一节中的词表大小为1159。"
]
},
{
"cell_type": "markdown",
"id": "7688ecd4",
"metadata": {},
"source": [
"作为额外的快速检查让我们打印更新词汇表的最后5个词"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f08e3a9d",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'vocab' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, item \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[43mvocab\u001b[49m\u001b[38;5;241m.\u001b[39mitems())[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m5\u001b[39m:]):\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(item)\n",
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
]
}
],
"source": [
"for i, item in enumerate(list(vocab.items())[-5:]):\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"id": "9bd79d47",
"metadata": {},
"source": [
"上面的代码打印如下内容:"
]
},
{
"cell_type": "markdown",
"id": "4eb990f9",
"metadata": {},
"source": [
"('younger', 1156)\n",
"('your', 1157)\n",
"('yourself', 1158)\n",
"('<|endoftext|>', 1159)\n",
"('<|unk|>', 1160)"
]
},
{
"cell_type": "markdown",
"id": "725cfe23",
"metadata": {},
"source": [
"根据上面的代码输出我们可以确认这两个新的特殊token确实成功地合并到了词表中。接下来我们相应地调整代码清单2.3中的tokenizer如清单2.4所示:"
]
},
{
"cell_type": "markdown",
"id": "930dca4b",
"metadata": {},
"source": [
"**清单2.4一个处理未知单词的简单文本标记器**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "31a26133",
"metadata": {},
"outputs": [],
"source": [
"class SimpleTokenizerV2:\n",
" def __init__(self, vocab):\n",
" self.str_to_int = vocab\n",
" self.int_to_str = { i:s for s,i in vocab.items()}\n",
" def encode(self, text):\n",
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
" preprocessed = [item.strip() for item in preprocessed if item.strip()]\n",
" preprocessed = [item if item in self.str_to_int else \"<|unk|>\" for item in preprocessed] #A\n",
" ids = [self.str_to_int[s] for s in preprocessed]\n",
" return ids\n",
" def decode(self, ids):\n",
" text = \" \".join([self.int_to_str[i] for i in ids])\n",
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n",
" return text\n"
]
},
{
"cell_type": "markdown",
"id": "273cda30",
"metadata": {},
"source": [
"与我们在上一节的代码清单2.3中实现的SimpleTokenizerV1相比新的SimpleTokenizerV2将未知单词替换为<|UNK|>token。"
]
},
{
"cell_type": "markdown",
"id": "f7f3040c",
"metadata": {},
"source": [
"现在让我们在实践中尝试这个新的标记器。为此,我们将使用一个简单的文本示例,它是由两个独立且不相关的句子连接而成的:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19a04bd4",
"metadata": {},
"outputs": [],
"source": [
"text1 = \"Hello, do you like tea?\"\n",
"text2 = \"In the sunlit terraces of the palace.\"\n",
"text = \" <|endoftext|> \".join((text1, text2))\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"id": "bcb6a7dc",
"metadata": {},
"source": [
"输出如下:"
]
},
{
"cell_type": "markdown",
"id": "157b8c26",
"metadata": {},
"source": [
"'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'"
]
},
{
"cell_type": "markdown",
"id": "c64915c5",
"metadata": {},
"source": [
"接下来让我们使用SimpleTokenizerV2对我们之前在清单2.2中创建的vocab进行tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "162d3403",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'vocab' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m SimpleTokenizerV2(\u001b[43mvocab\u001b[49m)\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(tokenizer\u001b[38;5;241m.\u001b[39mencode(text))\n",
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
]
}
],
"source": [
"tokenizer = SimpleTokenizerV2(vocab)\n",
"print(tokenizer.encode(text))\n"
]
},
{
"cell_type": "markdown",
"id": "9f532e55",
"metadata": {},
"source": [
"这将打印以下token ID"
]
},
{
"cell_type": "markdown",
"id": "55d238cf",
"metadata": {},
"source": [
"[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]"
]
},
{
"cell_type": "markdown",
"id": "a7988b14",
"metadata": {},
"source": [
"我们可以看到token ID列表包含1159即<|endoftext|>分隔符标记以及两个1160用于标记未知单词。"
]
},
{
"cell_type": "markdown",
"id": "ac7c9571",
"metadata": {},
"source": [
"让我们对文本进行去de-tokenize操作以进行快速的健全性检查"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "643173b0",
"metadata": {},
"outputs": [],
"source": [
"print(tokenizer.decode(tokenizer.encode(text)))"
]
},
{
"cell_type": "markdown",
"id": "1f6c9917",
"metadata": {},
"source": [
"输出如下所示:"
]
},
{
"cell_type": "markdown",
"id": "203d93b6",
"metadata": {},
"source": [
"'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'"
]
},
{
"cell_type": "markdown",
"id": "3ca2a152",
"metadata": {},
"source": [
"通过将上面的de-tokenize文本与原始输入文本进行比较我们知道训练数据集Edith Wharton的短篇小说The Verdict不包含单词\"Hello\"和\"palace\"。\n",
"\n",
"到目前为止我们已经讨论了tokenization这是处理作为LLMs输入的文本的重要步骤。根据LLM的不同一些研究人员还会考虑其他的特殊token例如\n",
"\n",
"·[BOS]beginning of sequence此标记标记文本的开始。LLM表示一段内容开始的位置。</br>\n",
"·[EOS]end of sequence这个标记位于文本的末尾在连接多个不相关的文本时特别有用类似于<|内文|>.例如,当合并两个不同的维基百科文章或书籍时,[EOS]令牌指示一篇文章结束的位置和下一篇文章开始的位置。</br>\n",
"·[PAD]padding当训练批量大小大于1的LLMs时该批可能包含不同长度的文本。为了确保所有文本具有相同的长度使用[PAD]标记扩展或“填充”较短的文本,直到批次中最长文本的长度。</br>\n"
]
},
{
"cell_type": "markdown",
"id": "c6120349",
"metadata": {},
"source": [
"注意用于GPT模型的标记器不需要上面提到的任何标记而只使用<|内文|> token for simplicity.的<|内文|”这是一个类似于上面提到的[EOS]令牌。此外,<|内文|“也是用来填充的。然而,正如我们将在后续章节中探索的那样,当在批量输入上训练时,我们通常使用掩码,这意味着我们不关注填充的令牌。因此,选择用于填充的特定令牌变得无关紧要。"
]
},
{
"cell_type": "markdown",
"id": "d3367570",
"metadata": {},
"source": [
"此外用于GPT模型的tokenizer也不使用<|UNK|>用于词表外的单词的标记。相反GPT模型使用字节对编码标记器它将单词分解为子单词单元这部分我们将在下一节中讨论。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce78d469",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llm_from_scratch",
"language": "python",
"name": "llm_from_scratch"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -75,7 +75,7 @@
"import tiktoken\n",
"\n",
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"with open(\"/Users/zhihu123/Project/other/llms-from-scratch-cn/ch02/01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
" raw_text = f.read()\n",
"enc_text = tokenizer.encode(raw_text)\n",
"print(len(enc_text))"
@ -441,7 +441,7 @@
}
],
"source": [
"with open(\"/Users/zhihu123/Project/other/llms-from-scratch-cn/ch02/01_main-chapter-code/the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
" raw_text = f.read()\n",
" dataloader = create_dataloader_v1(\n",
" raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)\n",

View File

@ -0,0 +1,214 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2cd2fcda-2fda-4aa8-8bc8-de1e496f9db1",
"metadata": {},
"source": [
"## 2.7 构建词符嵌入"
]
},
{
"cell_type": "markdown",
"id": "1a301068-6ab2-44ff-a915-1ba11688274f",
"metadata": {},
"source": [
"- 我们准备用于大语言模型LLM的数据已经差不多就绪了\n",
"- 接下来,我们要做的最后一步是使用嵌入层将 token 嵌入到连续的向量表示中。token本身不可计算需要将其映射到一个连续向量空间才可以进行后续运算这个映射的结果就是该token对应的embedding\n",
"- 通常这些用来转换词符的嵌入层是大语言模型LLM的一部分并且在模型训练的过程中会不断调整和优化。"
]
},
{
"cell_type": "markdown",
"id": "e85089aa-8671-4e5f-a2b3-ef252004ee4c",
"metadata": {},
"source": [
"<img src=\"https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-15.jpg?raw=true\" width=\"400px\">"
]
},
{
"cell_type": "markdown",
"id": "44e014ca-1fc5-4b90-b6fa-c2097bb92c0b",
"metadata": {},
"source": [
"- 假设我们在分词后有以下四个输入示例对应的输入ID分别是5、1、3和2"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "15a6304c-9474-4470-b85d-3991a49fa653",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"input_ids = torch.tensor([2, 3, 5, 1])"
]
},
{
"cell_type": "markdown",
"id": "14da6344-2c71-4837-858d-dd120005ba05",
"metadata": {},
"source": [
"- 为了简化问题假设我们有一个只包含6个单词的小型词汇表我们想要创建大小为3的嵌入。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "93cb2cee-9aa6-4bb8-8977-c65661d16eda",
"metadata": {},
"outputs": [],
"source": [
"vocab_size = 6\n",
"output_dim = 3\n",
"\n",
"torch.manual_seed(123)\n",
"embedding_layer = torch.nn.Embedding(vocab_size, output_dim)"
]
},
{
"cell_type": "markdown",
"id": "4ff241f6-78eb-4e4a-a55f-5b2b6196d5b0",
"metadata": {},
"source": [
"- 这将会生成一个6x3的权重矩阵"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a686eb61-e737-4351-8f1c-222913d47468",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Parameter containing:\n",
"tensor([[ 0.3374, -0.1778, -0.1690],\n",
" [ 0.9178, 1.5810, 1.3010],\n",
" [ 1.2753, -0.2010, -0.1606],\n",
" [-0.4015, 0.9666, -1.1481],\n",
" [-1.1589, 0.3255, -0.6315],\n",
" [-2.8400, -0.7849, -1.4096]], requires_grad=True)\n"
]
}
],
"source": [
"print(embedding_layer.weight)"
]
},
{
"cell_type": "markdown",
"id": "5e54d5f1",
"metadata": {},
"source": [
"- 由于嵌入层只是独热编码和矩阵乘法方法的一种更高效的实现,它可以被视为一个可以通过反向传播进行优化的神经网络层。\n",
"- 对于那些熟悉独热编码的人来说,上述嵌入层的方法本质上只是实现独热编码后进行矩阵乘法的一种更高效的手段,这种方法在全连接层中使用,其详细说明可以在补充代码[./embedding_vs_matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul)中找到。\n",
"- 因为嵌入层只是独热编码和矩阵乘法方法的一种更高效的实现,所以它可以被视为一个可以通过反向传播算法进行优化的神经网络层。"
]
},
{
"cell_type": "markdown",
"id": "4b0d58c3-83c0-4205-aca2-9c48b19fd4a7",
"metadata": {},
"source": [
"- 要将ID为3的词符转换为一个3维向量我们执行以下步骤"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e43600ba-f287-4746-8ddf-d0f71a9023ca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)\n"
]
}
],
"source": [
"print(embedding_layer(torch.tensor([3])))"
]
},
{
"cell_type": "markdown",
"id": "a7bbf625-4f36-491d-87b4-3969efb784b0",
"metadata": {},
"source": [
"- 注意,上述内容是`embedding_layer`权重矩阵中的第4行。\n",
"- 为了嵌入上面所有的四个`input_ids`值,我们执行以下操作:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "50280ead-0363-44c8-8c35-bb885d92c8b7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[ 1.2753, -0.2010, -0.1606],\n",
" [-0.4015, 0.9666, -1.1481],\n",
" [-2.8400, -0.7849, -1.4096],\n",
" [ 0.9178, 1.5810, 1.3010]], grad_fn=<EmbeddingBackward0>)\n"
]
}
],
"source": [
"print(embedding_layer(input_ids))"
]
},
{
"cell_type": "markdown",
"id": "be97ced4-bd13-42b7-866a-4d699a17e155",
"metadata": {},
"source": [
"- 嵌入层本质上是一种查找操作:"
]
},
{
"cell_type": "markdown",
"id": "f33c2741-bf1b-4c60-b7fd-61409d556646",
"metadata": {},
"source": [
"<img src=\"https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-16.jpg?raw=true\" width=\"500px\">"
]
},
{
"cell_type": "markdown",
"id": "08218d9f-aa1a-4afb-a105-72ff96a54e73",
"metadata": {},
"source": [
"- **您可能对比较嵌入层与常规线性层的附加内容感兴趣:[../03_bonus_embedding-vs-matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul)**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB