llms-from-scratch-cn/Translated_Book/ch02/2.2文本分词（序列化）.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "03fa84bd",
   "metadata": {},
   "source": [
    "# 2.2 文本分词（序列化）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4506a3e2",
   "metadata": {},
   "source": [
    "本节内容涵盖了我们如何将输入文本分割成单独的token，这是为大型语言模型（LLM）创建嵌入的必需预处理步骤。\n",
    "这些标记可能是单个单词或特殊字符，包括标点符号，如图2.4所示。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8ad846a",
   "metadata": {},
   "source": [
    "**图 2.4 本节中涉及的文本处理步骤在大型语言模型（LLM）中的视图。\n",
    "在这里，我们将输入文本分割成单独的token，这些token可能是单词或特殊字符，例如标点符号。\n",
    "在接下来的章节中，我们将把文本转换成token ID 并创建token嵌入。**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2df060a",
   "metadata": {},
   "source": [
    "![fig2.4](../img/fig-2-4.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee689dac",
   "metadata": {},
   "source": [
    "我们将为大型语言模型（LLM）训练分词的文本，是由伊迪丝·沃顿（Edith Wharton）创作的一部短篇小说《判决》（The Verdict），该作品版权已进入公共领域，因此我们可以用于LLM训练任务。\n",
    "这篇文章可以在 Wikisource 上找到，网址为 https://en.wikisource.org/wiki/The_Verdict ，您可以将其复制并粘贴到文本文件中，我已将其复制到名为“the-verdict.txt”的文本文件中，以便使用 Python 的标准文件读取工具加载："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df040cd5",
   "metadata": {},
   "source": [
    "### 代码示例 2.1：使用Python代码将短篇小说作为文本示例进行加载"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "68baa9b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of characters: 20479\n",
      "I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no \n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
    "response = requests.get(url)\n",
    "raw_text = response.text\n",
    "print(\"Total number of characters:\", len(raw_text))\n",
    "print(raw_text[:99])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1183553",
   "metadata": {},
   "source": [
    "或者，您也可以在本书的 GitHub 仓库中找到名为“the-verdict.txt”的文件，\n",
    "仓库地址为：https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01_main-chapter-code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b4109f8",
   "metadata": {},
   "source": [
    "print 命令用于打印文件的总字符数，我们随后打印文件的前100个字符，以此来进行示例说明："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36b04498",
   "metadata": {},
   "source": [
    "Total number of character:20479\\\n",
    "I HAD always thought Jack Gisburn rather a cheap genius--thougha good fellow enough--so it was no"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ecc9cb7",
   "metadata": {},
   "source": [
    "我们的目标是将这篇包含20,479个字符的短篇小说分词成单独的单词和特殊字符，以便在接下来的章节中将其转换成嵌入向量，用于大型语言模型（LLM）的训练。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f02d0829",
   "metadata": {},
   "source": [
    "###  样本文本的大小"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9598248",
   "metadata": {},
   "source": [
    "请注意，在运行大型语言模型（LLM）时，通常会处理数百万篇文章和数十万本书——数千兆字节的文本量。\n",
    "然而，出于教学目的，使用如单本书这样的小型文本样本就已足够。这样既可以清楚地展示文本处理的主要步骤，也能确保在普通消费级硬件上在合理时间内运行。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8d7b1cd",
   "metadata": {},
   "source": [
    "我们如何最好地分割这段文本以获得一个token列表？\n",
    "对此，我们将进行一次简短的探索，并使用 Python 的正则表达式库 re 模块来进行示例说明。\n",
    "（请注意，您不必学习或记住任何正则表达式的语法，因为我们将在本章后面转用预构建的分词器。）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d914f2ab",
   "metadata": {},
   "source": [
    "我们使用一些简单的示例文本，可以使用下面的的 re.split 命令来按空白字符分割文本："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "683ea6c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "text = \"Hello, world. This, is a test.\"\n",
    "result = re.split(r'(\\s)', text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3632203",
   "metadata": {},
   "source": [
    "结果是一个包含单个单词、空格和标点符号的列表："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a209f23a",
   "metadata": {},
   "source": [
    "['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', '\n",
    "', 'test.']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28d1cd96",
   "metadata": {},
   "source": [
    "请注意，上述简单的分词方案主要用于将示例文本分解成单独的词，但仍有一些单词与我们希望单独列出的标点符号相连。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72303a4c",
   "metadata": {},
   "source": [
    "让我们修改正则表达式，在空格（\\s）以及逗号和句号（[,.]）处进行分割："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7d7c3026",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']\n"
     ]
    }
   ],
   "source": [
    "result = re.split(r'([,.]|\\s)', text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8254d5",
   "metadata": {},
   "source": [
    "我们可以看到，单词和标点符号现在正如我们所想要的成为了列表中的独立条目："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "850e0524",
   "metadata": {},
   "source": [
    "['Hello', ',', '', ' ', 'world.', ' ', 'This', ',', '', ' ',\n",
    "'is', ' ', 'a', ' ', 'test.']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff3ef273",
   "metadata": {},
   "source": [
    "还有一个小问题是，列表中仍包含空白字符。\n",
    "我们可以选择安全地移除这些多余的字符，操作如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "504c9f05",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']\n"
     ]
    }
   ],
   "source": [
    "result = [item.strip() for item in result if item.strip()]\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cdb7d9d",
   "metadata": {},
   "source": [
    "结果产生的无空白输出如下："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6aef4466",
   "metadata": {},
   "source": [
    "['Hello', ',', 'world.', 'This', ',', 'is', 'a', 'test.']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d49a92c0",
   "metadata": {},
   "source": [
    "### Removing whitespaces or not"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f530be2",
   "metadata": {},
   "source": [
    "### 是否移除空格"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bee6f354",
   "metadata": {},
   "source": [
    "在开发一个简单的分词器时，是否应该将空格编码为单独的字符或者直接移除它们，这取决于我们的应用及其需求。\n",
    "移除空格可以减少内存和计算需求。然而，保留空格在我们训练对文本的精确结构敏感的模型时可能是有用的\n",
    "（例如，Python代码对缩进和间距非常敏感）。\n",
    "这里，我们为了简化和简洁化分词输出而移除空格。\n",
    "随后，我们将进入到一个包括空格的分词方案。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7db9c8ef",
   "metadata": {},
   "source": [
    "我们在上面部分设计的分词方案在简单样本文本上表现良好。\n",
    "现在，让我们进一步修改它，让它也能处理其他类型的标点符号，\n",
    "比如问号、引号和我们在伊迪丝·沃顿的短篇小说前100个字符中能看到的双破折号，以及其他额外的特殊字符："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2faa1386",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n"
     ]
    }
   ],
   "source": [
    "text = \"Hello, world. Is this-- a test?\"\n",
    "result = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
    "result = [item.strip() for item in result if item.strip()]\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcf02c46",
   "metadata": {},
   "source": [
    "结果输出如下所示："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1aa080e9",
   "metadata": {},
   "source": [
    "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test',\n",
    "'?']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca1ded4a",
   "metadata": {},
   "source": [
    "正如我们可以从图2.5中总结的结果看到的，\n",
    "我们的分词方案现在可以成功处理文本中的各种特殊字符。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd4baf9",
   "metadata": {},
   "source": [
    "**图2.5 我们到目前为止实现的分词方案将文本分割为单独的单词和标点符号。在此图中显示的具体示例中，样本文本被分割成10个单独的标记。**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f85ecff5",
   "metadata": {},
   "source": [
    "![fig2.5](../img/fig-2-5.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15bddfbd",
   "metadata": {},
   "source": [
    "现在我们已经让一个基础的分词器开始运行了，\n",
    "让我们将它部署到埃迪斯·华顿的整个短篇小说上："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d5356685",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4649\n"
     ]
    }
   ],
   "source": [
    "preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', raw_text)\n",
    "preprocessed = [item.strip() for item in preprocessed if\n",
    "item.strip()]\n",
    "print(len(preprocessed))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c492a63",
   "metadata": {},
   "source": [
    "上述打印语句输出的是4649，这是该文本中的token数量（不包括空格）。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c93d099",
   "metadata": {},
   "source": [
    "让我们打印前30个标记以进行快速目测："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a0865898",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']\n"
     ]
    }
   ],
   "source": [
    "print(preprocessed[:30])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f42d2f8",
   "metadata": {},
   "source": [
    "结果输出显示，我们的分词器似乎很好地处理了文本，因为所有的单词和特殊字符都被整齐地分开了："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c2d4bfe",
   "metadata": {},
   "source": [
    "['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather',\n",
    "'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow',\n",
    "'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise',\n",
    "'to', 'me', 'to', 'hear', 'that', ',', 'in']"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (cell)",
   "language": "python",
   "name": "cell"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}