mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
402 lines
14 KiB
Plaintext
402 lines
14 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3cdf73ca",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 2.4 添加特殊的上下文tokens"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1019c5ac",
|
||
"metadata": {},
|
||
"source": [
|
||
"在上一节中,我们实现了一个简单的分词器,并将其应用于训练集中的一段。在本节中,我们将修改这个分词器来处理未知单词。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7ba5b6a1",
|
||
"metadata": {},
|
||
"source": [
|
||
"我们还将讨论特殊上下文标记的使用和添加,这些标记可以增强模型对文本中上下文或其他相关信息的理解。例如,这些特殊标记可以包括未知单词和文档边界的标记。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8b9bb7be",
|
||
"metadata": {},
|
||
"source": [
|
||
"特别是,我们将修改在上一节SimpleTokenizerV2中实现的词汇表和标记器,以支持两个新标记<|UNK|>和<|内文|如图2.9所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e8457f8c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**图2.9 我们在词汇表中添加特殊的标记来处理特定的上下文。例如,我们添加<|UNK|> token表示新的和未知的单词,这些单词不是训练数据的一部分,因此也不是现有词汇表的一部分。此外,我们还添加了一个<|内文|> token,我们可以使用它来分隔两个不相关的文本源。**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "490fa60b",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "233806c3",
|
||
"metadata": {},
|
||
"source": [
|
||
"如图2.9所示,我们可以修改tokenizer以使用<|UNK|> token,如果它遇到一个不属于词汇表的单词。此外,我们在不相关的文本之间添加标记。例如,当在多个独立的文档或书籍上训练类似GPT的LLMs时,通常会在前一个文本源之后的每个文档或书籍之前插入一个令牌,如图2.10所示。\n",
|
||
"这有助于LLM理解,尽管这些文本源是为了训练而连接的,但它们实际上是不相关的。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "86ded03a",
|
||
"metadata": {},
|
||
"source": [
|
||
"**图2.10当处理多个独立的文本源时,我们在这些文本间添加叫做<|endoftext|>的tokens。这些<|endoftext|>tokens作为标记,标志着一个特定段落的开始和结束,这使得LLM能更有效地处理和理解文本。**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "acc76cd1",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5792917b",
|
||
"metadata": {},
|
||
"source": [
|
||
"现在让我们修改词汇表,以包含这两个特殊的token,<unk>以及<|endoftext|>,并将它们添加到我们在上一节中创建的唯一词表中:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "38439456",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "NameError",
|
||
"evalue": "name 'preprocessed' is not defined",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[1;32mIn[4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m all_tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(\u001b[43mpreprocessed\u001b[49m)))\n\u001b[0;32m 2\u001b[0m all_tokens\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|endoftext|>\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|unk|>\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m 3\u001b[0m vocab \u001b[38;5;241m=\u001b[39m {token:integer \u001b[38;5;28;01mfor\u001b[39;00m integer,token \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(all_tokens)}\n",
|
||
"\u001b[1;31mNameError\u001b[0m: name 'preprocessed' is not defined"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"all_tokens = sorted(list(set(preprocessed)))\n",
|
||
"all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n",
|
||
"vocab = {token:integer for integer,token in enumerate(all_tokens)}\n",
|
||
"print(len(vocab.items()))\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6c7a776a",
|
||
"metadata": {},
|
||
"source": [
|
||
"根据print语句的输出,新的词表大小为1161(上一节中的词表大小为1159)。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7688ecd4",
|
||
"metadata": {},
|
||
"source": [
|
||
"作为额外的快速检查,让我们打印更新词汇表的最后5个词:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "f08e3a9d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "NameError",
|
||
"evalue": "name 'vocab' is not defined",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[1;32mIn[5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, item \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[43mvocab\u001b[49m\u001b[38;5;241m.\u001b[39mitems())[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m5\u001b[39m:]):\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(item)\n",
|
||
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for i, item in enumerate(list(vocab.items())[-5:]):\n",
|
||
" print(item)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9bd79d47",
|
||
"metadata": {},
|
||
"source": [
|
||
"上面的代码打印如下内容:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4eb990f9",
|
||
"metadata": {},
|
||
"source": [
|
||
"('younger', 1156)\n",
|
||
"('your', 1157)\n",
|
||
"('yourself', 1158)\n",
|
||
"('<|endoftext|>', 1159)\n",
|
||
"('<|unk|>', 1160)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "725cfe23",
|
||
"metadata": {},
|
||
"source": [
|
||
"根据上面的代码输出,我们可以确认这两个新的特殊token确实成功地合并到了词表中。接下来,我们相应地调整代码清单2.3中的tokenizer,如清单2.4所示:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "930dca4b",
|
||
"metadata": {},
|
||
"source": [
|
||
"**清单2.4一个处理未知单词的简单文本标记器**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "31a26133",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class SimpleTokenizerV2:\n",
|
||
" def __init__(self, vocab):\n",
|
||
" self.str_to_int = vocab\n",
|
||
" self.int_to_str = { i:s for s,i in vocab.items()}\n",
|
||
" def encode(self, text):\n",
|
||
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
|
||
" preprocessed = [item.strip() for item in preprocessed if item.strip()]\n",
|
||
" preprocessed = [item if item in self.str_to_int else \"<|unk|>\" for item in preprocessed] #A\n",
|
||
" ids = [self.str_to_int[s] for s in preprocessed]\n",
|
||
" return ids\n",
|
||
" def decode(self, ids):\n",
|
||
" text = \" \".join([self.int_to_str[i] for i in ids])\n",
|
||
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n",
|
||
" return text\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "273cda30",
|
||
"metadata": {},
|
||
"source": [
|
||
"与我们在上一节的代码清单2.3中实现的SimpleTokenizerV1相比,新的SimpleTokenizerV2将未知单词替换为<|UNK|>token。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f7f3040c",
|
||
"metadata": {},
|
||
"source": [
|
||
"现在让我们在实践中尝试这个新的标记器。为此,我们将使用一个简单的文本示例,它是由两个独立且不相关的句子连接而成的:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "19a04bd4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text1 = \"Hello, do you like tea?\"\n",
|
||
"text2 = \"In the sunlit terraces of the palace.\"\n",
|
||
"text = \" <|endoftext|> \".join((text1, text2))\n",
|
||
"print(text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bcb6a7dc",
|
||
"metadata": {},
|
||
"source": [
|
||
"输出如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "157b8c26",
|
||
"metadata": {},
|
||
"source": [
|
||
"'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c64915c5",
|
||
"metadata": {},
|
||
"source": [
|
||
"接下来,让我们使用SimpleTokenizerV2对我们之前在清单2.2中创建的vocab进行tokenizer:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "162d3403",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "NameError",
|
||
"evalue": "name 'vocab' is not defined",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[1;32mIn[3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m SimpleTokenizerV2(\u001b[43mvocab\u001b[49m)\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(tokenizer\u001b[38;5;241m.\u001b[39mencode(text))\n",
|
||
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"tokenizer = SimpleTokenizerV2(vocab)\n",
|
||
"print(tokenizer.encode(text))\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9f532e55",
|
||
"metadata": {},
|
||
"source": [
|
||
"这将打印以下token ID:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "55d238cf",
|
||
"metadata": {},
|
||
"source": [
|
||
"[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a7988b14",
|
||
"metadata": {},
|
||
"source": [
|
||
"我们可以看到token ID列表包含1159,即<|endoftext|>分隔符标记;以及两个1160,用于标记未知单词。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ac7c9571",
|
||
"metadata": {},
|
||
"source": [
|
||
"让我们对文本进行去de-tokenize操作,以进行快速的健全性检查:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "643173b0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(tokenizer.decode(tokenizer.encode(text)))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1f6c9917",
|
||
"metadata": {},
|
||
"source": [
|
||
"输出如下所示:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "203d93b6",
|
||
"metadata": {},
|
||
"source": [
|
||
"'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3ca2a152",
|
||
"metadata": {},
|
||
"source": [
|
||
"通过将上面的de-tokenize文本与原始输入文本进行比较,我们知道训练数据集,Edith Wharton的短篇小说The Verdict,不包含单词\"Hello\"和\"palace\"。\n",
|
||
"\n",
|
||
"到目前为止,我们已经讨论了tokenization,这是处理作为LLMs输入的文本的重要步骤。根据LLM的不同,一些研究人员还会考虑其他的特殊token,例如:\n",
|
||
"\n",
|
||
"·[BOS](beginning of sequence):此标记标记文本的开始。LLM表示一段内容开始的位置。</br>\n",
|
||
"·[EOS](end of sequence):这个标记位于文本的末尾,在连接多个不相关的文本时特别有用,类似于<|内文|>.例如,当合并两个不同的维基百科文章或书籍时,[EOS]令牌指示一篇文章结束的位置和下一篇文章开始的位置。</br>\n",
|
||
"·[PAD](padding):当训练批量大小大于1的LLMs时,该批可能包含不同长度的文本。为了确保所有文本具有相同的长度,使用[PAD]标记扩展或“填充”较短的文本,直到批次中最长文本的长度。</br>\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c6120349",
|
||
"metadata": {},
|
||
"source": [
|
||
"注意,用于GPT模型的标记器不需要上面提到的任何标记,而只使用<|内文|> token for simplicity.的<|内文|”这是一个类似于上面提到的[EOS]令牌。此外,<|内文|“也是用来填充的。然而,正如我们将在后续章节中探索的那样,当在批量输入上训练时,我们通常使用掩码,这意味着我们不关注填充的令牌。因此,选择用于填充的特定令牌变得无关紧要。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d3367570",
|
||
"metadata": {},
|
||
"source": [
|
||
"此外,用于GPT模型的tokenizer也不使用<|UNK|>用于词表外的单词的标记。相反,GPT模型使用字节对编码标记器,它将单词分解为子单词单元,这部分我们将在下一节中讨论。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ce78d469",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "llm_from_scratch",
|
||
"language": "python",
|
||
"name": "llm_from_scratch"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.14"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|