llms-from-scratch-cn/Translated_Book/ch02/2.4添加特殊上下文tokens.ipynb
2026-03-26 12:16:45 +08:00

402 lines
14 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "3cdf73ca",
"metadata": {},
"source": [
"# 2.4 添加特殊的上下文tokens"
]
},
{
"cell_type": "markdown",
"id": "1019c5ac",
"metadata": {},
"source": [
"在上一节中,我们实现了一个简单的分词器,并将其应用于训练集中的一段。在本节中,我们将修改这个分词器来处理未知单词。"
]
},
{
"cell_type": "markdown",
"id": "7ba5b6a1",
"metadata": {},
"source": [
"我们还将讨论特殊上下文标记的使用和添加,这些标记可以增强模型对文本中上下文或其他相关信息的理解。例如,这些特殊标记可以包括未知单词和文档边界的标记。"
]
},
{
"cell_type": "markdown",
"id": "8b9bb7be",
"metadata": {},
"source": [
"特别是我们将修改在上一节SimpleTokenizerV2中实现的词汇表和标记器以支持两个新标记<|UNK|>和<|内文|如图2.9所示。"
]
},
{
"cell_type": "markdown",
"id": "e8457f8c",
"metadata": {},
"source": [
"**图2.9 我们在词汇表中添加特殊的标记来处理特定的上下文。例如,我们添加<|UNK|> token表示新的和未知的单词这些单词不是训练数据的一部分因此也不是现有词汇表的一部分。此外我们还添加了一个<|内文|> token我们可以使用它来分隔两个不相关的文本源。**"
]
},
{
"cell_type": "markdown",
"id": "490fa60b",
"metadata": {},
"source": [
"![fig2.20](../img/fig-2-20.png)"
]
},
{
"cell_type": "markdown",
"id": "233806c3",
"metadata": {},
"source": [
"如图2.9所示我们可以修改tokenizer以使用<|UNK|> token如果它遇到一个不属于词汇表的单词。此外我们在不相关的文本之间添加标记。例如当在多个独立的文档或书籍上训练类似GPT的LLMs时通常会在前一个文本源之后的每个文档或书籍之前插入一个令牌如图2.10所示。\n",
"这有助于LLM理解尽管这些文本源是为了训练而连接的但它们实际上是不相关的。"
]
},
{
"cell_type": "markdown",
"id": "86ded03a",
"metadata": {},
"source": [
"**图2.10当处理多个独立的文本源时,我们在这些文本间添加叫做<|endoftext|>的tokens。这些<|endoftext|>tokens作为标记标志着一个特定段落的开始和结束这使得LLM能更有效地处理和理解文本。**"
]
},
{
"cell_type": "markdown",
"id": "acc76cd1",
"metadata": {},
"source": [
"![fig2.21](../img/fig-2-21.png)"
]
},
{
"cell_type": "markdown",
"id": "5792917b",
"metadata": {},
"source": [
"现在让我们修改词汇表以包含这两个特殊的token<unk>以及<|endoftext|>,并将它们添加到我们在上一节中创建的唯一词表中:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "38439456",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'preprocessed' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m all_tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(\u001b[43mpreprocessed\u001b[49m)))\n\u001b[0;32m 2\u001b[0m all_tokens\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|endoftext|>\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|unk|>\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m 3\u001b[0m vocab \u001b[38;5;241m=\u001b[39m {token:integer \u001b[38;5;28;01mfor\u001b[39;00m integer,token \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(all_tokens)}\n",
"\u001b[1;31mNameError\u001b[0m: name 'preprocessed' is not defined"
]
}
],
"source": [
"all_tokens = sorted(list(set(preprocessed)))\n",
"all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n",
"vocab = {token:integer for integer,token in enumerate(all_tokens)}\n",
"print(len(vocab.items()))\n"
]
},
{
"cell_type": "markdown",
"id": "6c7a776a",
"metadata": {},
"source": [
"根据print语句的输出新的词表大小为1161上一节中的词表大小为1159。"
]
},
{
"cell_type": "markdown",
"id": "7688ecd4",
"metadata": {},
"source": [
"作为额外的快速检查让我们打印更新词汇表的最后5个词"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f08e3a9d",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'vocab' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, item \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[43mvocab\u001b[49m\u001b[38;5;241m.\u001b[39mitems())[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m5\u001b[39m:]):\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(item)\n",
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
]
}
],
"source": [
"for i, item in enumerate(list(vocab.items())[-5:]):\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"id": "9bd79d47",
"metadata": {},
"source": [
"上面的代码打印如下内容:"
]
},
{
"cell_type": "markdown",
"id": "4eb990f9",
"metadata": {},
"source": [
"('younger', 1156)\n",
"('your', 1157)\n",
"('yourself', 1158)\n",
"('<|endoftext|>', 1159)\n",
"('<|unk|>', 1160)"
]
},
{
"cell_type": "markdown",
"id": "725cfe23",
"metadata": {},
"source": [
"根据上面的代码输出我们可以确认这两个新的特殊token确实成功地合并到了词表中。接下来我们相应地调整代码清单2.3中的tokenizer如清单2.4所示:"
]
},
{
"cell_type": "markdown",
"id": "930dca4b",
"metadata": {},
"source": [
"**清单2.4一个处理未知单词的简单文本标记器**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "31a26133",
"metadata": {},
"outputs": [],
"source": [
"class SimpleTokenizerV2:\n",
" def __init__(self, vocab):\n",
" self.str_to_int = vocab\n",
" self.int_to_str = { i:s for s,i in vocab.items()}\n",
" def encode(self, text):\n",
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
" preprocessed = [item.strip() for item in preprocessed if item.strip()]\n",
" preprocessed = [item if item in self.str_to_int else \"<|unk|>\" for item in preprocessed] #A\n",
" ids = [self.str_to_int[s] for s in preprocessed]\n",
" return ids\n",
" def decode(self, ids):\n",
" text = \" \".join([self.int_to_str[i] for i in ids])\n",
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n",
" return text\n"
]
},
{
"cell_type": "markdown",
"id": "273cda30",
"metadata": {},
"source": [
"与我们在上一节的代码清单2.3中实现的SimpleTokenizerV1相比新的SimpleTokenizerV2将未知单词替换为<|UNK|>token。"
]
},
{
"cell_type": "markdown",
"id": "f7f3040c",
"metadata": {},
"source": [
"现在让我们在实践中尝试这个新的标记器。为此,我们将使用一个简单的文本示例,它是由两个独立且不相关的句子连接而成的:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19a04bd4",
"metadata": {},
"outputs": [],
"source": [
"text1 = \"Hello, do you like tea?\"\n",
"text2 = \"In the sunlit terraces of the palace.\"\n",
"text = \" <|endoftext|> \".join((text1, text2))\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"id": "bcb6a7dc",
"metadata": {},
"source": [
"输出如下:"
]
},
{
"cell_type": "markdown",
"id": "157b8c26",
"metadata": {},
"source": [
"'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'"
]
},
{
"cell_type": "markdown",
"id": "c64915c5",
"metadata": {},
"source": [
"接下来让我们使用SimpleTokenizerV2对我们之前在清单2.2中创建的vocab进行tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "162d3403",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'vocab' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m SimpleTokenizerV2(\u001b[43mvocab\u001b[49m)\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(tokenizer\u001b[38;5;241m.\u001b[39mencode(text))\n",
"\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
]
}
],
"source": [
"tokenizer = SimpleTokenizerV2(vocab)\n",
"print(tokenizer.encode(text))\n"
]
},
{
"cell_type": "markdown",
"id": "9f532e55",
"metadata": {},
"source": [
"这将打印以下token ID"
]
},
{
"cell_type": "markdown",
"id": "55d238cf",
"metadata": {},
"source": [
"[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]"
]
},
{
"cell_type": "markdown",
"id": "a7988b14",
"metadata": {},
"source": [
"我们可以看到token ID列表包含1159即<|endoftext|>分隔符标记以及两个1160用于标记未知单词。"
]
},
{
"cell_type": "markdown",
"id": "ac7c9571",
"metadata": {},
"source": [
"让我们对文本进行去de-tokenize操作以进行快速的健全性检查"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "643173b0",
"metadata": {},
"outputs": [],
"source": [
"print(tokenizer.decode(tokenizer.encode(text)))"
]
},
{
"cell_type": "markdown",
"id": "1f6c9917",
"metadata": {},
"source": [
"输出如下所示:"
]
},
{
"cell_type": "markdown",
"id": "203d93b6",
"metadata": {},
"source": [
"'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'"
]
},
{
"cell_type": "markdown",
"id": "3ca2a152",
"metadata": {},
"source": [
"通过将上面的de-tokenize文本与原始输入文本进行比较我们知道训练数据集Edith Wharton的短篇小说The Verdict不包含单词\"Hello\"和\"palace\"。\n",
"\n",
"到目前为止我们已经讨论了tokenization这是处理作为LLMs输入的文本的重要步骤。根据LLM的不同一些研究人员还会考虑其他的特殊token例如\n",
"\n",
"·[BOS]beginning of sequence此标记标记文本的开始。LLM表示一段内容开始的位置。</br>\n",
"·[EOS]end of sequence这个标记位于文本的末尾在连接多个不相关的文本时特别有用类似于<|内文|>.例如,当合并两个不同的维基百科文章或书籍时,[EOS]令牌指示一篇文章结束的位置和下一篇文章开始的位置。</br>\n",
"·[PAD]padding当训练批量大小大于1的LLMs时该批可能包含不同长度的文本。为了确保所有文本具有相同的长度使用[PAD]标记扩展或“填充”较短的文本,直到批次中最长文本的长度。</br>\n"
]
},
{
"cell_type": "markdown",
"id": "c6120349",
"metadata": {},
"source": [
"注意用于GPT模型的标记器不需要上面提到的任何标记而只使用<|内文|> token for simplicity.的<|内文|”这是一个类似于上面提到的[EOS]令牌。此外,<|内文|“也是用来填充的。然而,正如我们将在后续章节中探索的那样,当在批量输入上训练时,我们通常使用掩码,这意味着我们不关注填充的令牌。因此,选择用于填充的特定令牌变得无关紧要。"
]
},
{
"cell_type": "markdown",
"id": "d3367570",
"metadata": {},
"source": [
"此外用于GPT模型的tokenizer也不使用<|UNK|>用于词表外的单词的标记。相反GPT模型使用字节对编码标记器它将单词分解为子单词单元这部分我们将在下一节中讨论。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce78d469",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llm_from_scratch",
"language": "python",
"name": "llm_from_scratch"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}