llms-from-scratch-cn/Translated_Book/ch02/2.4添加特殊上下文tokens.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3cdf73ca",
   "metadata": {},
   "source": [
    "# 2.4 添加特殊的上下文tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1019c5ac",
   "metadata": {},
   "source": [
    "在上一节中，我们实现了一个简单的分词器，并将其应用于训练集中的一段。在本节中，我们将修改这个分词器来处理未知单词。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ba5b6a1",
   "metadata": {},
   "source": [
    "我们还将讨论特殊上下文标记的使用和添加，这些标记可以增强模型对文本中上下文或其他相关信息的理解。例如，这些特殊标记可以包括未知单词和文档边界的标记。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b9bb7be",
   "metadata": {},
   "source": [
    "特别是，我们将修改在上一节SimpleTokenizerV2中实现的词汇表和标记器，以支持两个新标记<|UNK|>和<|内文|如图2.9所示。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8457f8c",
   "metadata": {},
   "source": [
    "**图2.9 我们在词汇表中添加特殊的标记来处理特定的上下文。例如，我们添加<|UNK|> token表示新的和未知的单词，这些单词不是训练数据的一部分，因此也不是现有词汇表的一部分。此外，我们还添加了一个<|内文|> token，我们可以使用它来分隔两个不相关的文本源。**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "490fa60b",
   "metadata": {},
   "source": [
    "![fig2.20](../img/fig-2-20.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "233806c3",
   "metadata": {},
   "source": [
    "如图2.9所示，我们可以修改tokenizer以使用<|UNK|> token，如果它遇到一个不属于词汇表的单词。此外，我们在不相关的文本之间添加标记。例如，当在多个独立的文档或书籍上训练类似GPT的LLMs时，通常会在前一个文本源之后的每个文档或书籍之前插入一个令牌，如图2.10所示。\n",
    "这有助于LLM理解，尽管这些文本源是为了训练而连接的，但它们实际上是不相关的。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86ded03a",
   "metadata": {},
   "source": [
    "**图2.10当处理多个独立的文本源时，我们在这些文本间添加叫做<|endoftext|>的tokens。这些<|endoftext|>tokens作为标记，标志着一个特定段落的开始和结束，这使得LLM能更有效地处理和理解文本。**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acc76cd1",
   "metadata": {},
   "source": [
    "![fig2.21](../img/fig-2-21.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5792917b",
   "metadata": {},
   "source": [
    "现在让我们修改词汇表，以包含这两个特殊的token，<unk>以及<|endoftext|>，并将它们添加到我们在上一节中创建的唯一词表中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "38439456",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'preprocessed' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m all_tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msorted\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(\u001b[43mpreprocessed\u001b[49m)))\n\u001b[0;32m      2\u001b[0m all_tokens\u001b[38;5;241m.\u001b[39mextend([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|endoftext|>\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m<|unk|>\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m      3\u001b[0m vocab \u001b[38;5;241m=\u001b[39m {token:integer \u001b[38;5;28;01mfor\u001b[39;00m integer,token \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(all_tokens)}\n",
      "\u001b[1;31mNameError\u001b[0m: name 'preprocessed' is not defined"
     ]
    }
   ],
   "source": [
    "all_tokens = sorted(list(set(preprocessed)))\n",
    "all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n",
    "vocab = {token:integer for integer,token in enumerate(all_tokens)}\n",
    "print(len(vocab.items()))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c7a776a",
   "metadata": {},
   "source": [
    "根据print语句的输出，新的词表大小为1161（上一节中的词表大小为1159）。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7688ecd4",
   "metadata": {},
   "source": [
    "作为额外的快速检查，让我们打印更新词汇表的最后5个词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f08e3a9d",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'vocab' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, item \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28mlist\u001b[39m(\u001b[43mvocab\u001b[49m\u001b[38;5;241m.\u001b[39mitems())[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m5\u001b[39m:]):\n\u001b[0;32m      2\u001b[0m     \u001b[38;5;28mprint\u001b[39m(item)\n",
      "\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
     ]
    }
   ],
   "source": [
    "for i, item in enumerate(list(vocab.items())[-5:]):\n",
    "    print(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bd79d47",
   "metadata": {},
   "source": [
    "上面的代码打印如下内容："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4eb990f9",
   "metadata": {},
   "source": [
    "('younger', 1156)\n",
    "('your', 1157)\n",
    "('yourself', 1158)\n",
    "('<|endoftext|>', 1159)\n",
    "('<|unk|>', 1160)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "725cfe23",
   "metadata": {},
   "source": [
    "根据上面的代码输出，我们可以确认这两个新的特殊token确实成功地合并到了词表中。接下来，我们相应地调整代码清单2.3中的tokenizer，如清单2.4所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "930dca4b",
   "metadata": {},
   "source": [
    "**清单2.4一个处理未知单词的简单文本标记器**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "31a26133",
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleTokenizerV2:\n",
    "    def __init__(self, vocab):\n",
    "        self.str_to_int = vocab\n",
    "        self.int_to_str = { i:s for s,i in vocab.items()}\n",
    "    def encode(self, text):\n",
    "        preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
    "        preprocessed = [item.strip() for item in preprocessed if item.strip()]\n",
    "        preprocessed = [item if item in self.str_to_int else \"<|unk|>\" for item in preprocessed] #A\n",
    "        ids = [self.str_to_int[s] for s in preprocessed]\n",
    "        return ids\n",
    "    def decode(self, ids):\n",
    "        text = \" \".join([self.int_to_str[i] for i in ids])\n",
    "        text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n",
    "        return text\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "273cda30",
   "metadata": {},
   "source": [
    "与我们在上一节的代码清单2.3中实现的SimpleTokenizerV1相比，新的SimpleTokenizerV2将未知单词替换为<|UNK|>token。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7f3040c",
   "metadata": {},
   "source": [
    "现在让我们在实践中尝试这个新的标记器。为此，我们将使用一个简单的文本示例，它是由两个独立且不相关的句子连接而成的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19a04bd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "text1 = \"Hello, do you like tea?\"\n",
    "text2 = \"In the sunlit terraces of the palace.\"\n",
    "text = \" <|endoftext|> \".join((text1, text2))\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcb6a7dc",
   "metadata": {},
   "source": [
    "输出如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "157b8c26",
   "metadata": {},
   "source": [
    "'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c64915c5",
   "metadata": {},
   "source": [
    "接下来，让我们使用SimpleTokenizerV2对我们之前在清单2.2中创建的vocab进行tokenizer："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "162d3403",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'vocab' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m tokenizer \u001b[38;5;241m=\u001b[39m SimpleTokenizerV2(\u001b[43mvocab\u001b[49m)\n\u001b[0;32m      2\u001b[0m \u001b[38;5;28mprint\u001b[39m(tokenizer\u001b[38;5;241m.\u001b[39mencode(text))\n",
      "\u001b[1;31mNameError\u001b[0m: name 'vocab' is not defined"
     ]
    }
   ],
   "source": [
    "tokenizer = SimpleTokenizerV2(vocab)\n",
    "print(tokenizer.encode(text))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f532e55",
   "metadata": {},
   "source": [
    "这将打印以下token ID："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55d238cf",
   "metadata": {},
   "source": [
    "[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7988b14",
   "metadata": {},
   "source": [
    "我们可以看到token ID列表包含1159，即<|endoftext|>分隔符标记；以及两个1160，用于标记未知单词。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac7c9571",
   "metadata": {},
   "source": [
    "让我们对文本进行去de-tokenize操作，以进行快速的健全性检查："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "643173b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(tokenizer.decode(tokenizer.encode(text)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f6c9917",
   "metadata": {},
   "source": [
    "输出如下所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "203d93b6",
   "metadata": {},
   "source": [
    "'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ca2a152",
   "metadata": {},
   "source": [
    "通过将上面的de-tokenize文本与原始输入文本进行比较，我们知道训练数据集，Edith Wharton的短篇小说The Verdict，不包含单词\"Hello\"和\"palace\"。\n",
    "\n",
    "到目前为止，我们已经讨论了tokenization，这是处理作为LLMs输入的文本的重要步骤。根据LLM的不同，一些研究人员还会考虑其他的特殊token，例如：\n",
    "\n",
    "·[BOS]（beginning of sequence）：此标记标记文本的开始。LLM表示一段内容开始的位置。</br>\n",
    "·[EOS]（end of sequence）：这个标记位于文本的末尾，在连接多个不相关的文本时特别有用，类似于<|内文|>.例如，当合并两个不同的维基百科文章或书籍时，[EOS]令牌指示一篇文章结束的位置和下一篇文章开始的位置。</br>\n",
    "·[PAD]（padding）：当训练批量大小大于1的LLMs时，该批可能包含不同长度的文本。为了确保所有文本具有相同的长度，使用[PAD]标记扩展或“填充”较短的文本，直到批次中最长文本的长度。</br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6120349",
   "metadata": {},
   "source": [
    "注意，用于GPT模型的标记器不需要上面提到的任何标记，而只使用<|内文|> token for simplicity.的<|内文|”这是一个类似于上面提到的[EOS]令牌。此外，<|内文|“也是用来填充的。然而，正如我们将在后续章节中探索的那样，当在批量输入上训练时，我们通常使用掩码，这意味着我们不关注填充的令牌。因此，选择用于填充的特定令牌变得无关紧要。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3367570",
   "metadata": {},
   "source": [
    "此外，用于GPT模型的tokenizer也不使用<|UNK|>用于词表外的单词的标记。相反，GPT模型使用字节对编码标记器，它将单词分解为子单词单元，这部分我们将在下一节中讨论。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce78d469",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llm_from_scratch",
   "language": "python",
   "name": "llm_from_scratch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}