161
Translated_Book/ch02/2.1理解词嵌入.ipynb
Normal file
@ -0,0 +1,161 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3cdf73ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2.1 理解词嵌入"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1019c5ac",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"深度神经网络模型,包括大型语言模型(LLMs),是无法直接处理原始文本。\n",
|
||||
"由于文本属于分类数据,它与用于实施和训练神经网络的数学运算不兼容。\n",
|
||||
"因此,我们需要一种方法将单词表示为连续值向量。 \n",
|
||||
"(不熟悉在上下文中进行向量和张量计算的读者可以在附录A,A2.2《理解张量》中了解更多信息。)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ba5b6a1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"将数据转换为向量格式的概念通常被称为嵌入。\n",
|
||||
"如图2.2所示,通过使用特定的神经网络层或其他预训练的神经网络模型,我们可以将不同类型的数据嵌入,例如视频、音频和文本。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e8457f8c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图2.2 \n",
|
||||
"深度学习模型无法处理视频、音频和文本等原始数据格式。\n",
|
||||
"因此,我们使用嵌入模型将这些原始数据转换为密集向量表示,这样深度学习架构就可以轻松理解和处理原始数据。\n",
|
||||
"具体来说,此图展示了将原始数据转换为三维数值向量的过程。\n",
|
||||
"需要注意的是,不同的数据格式需要不同的嵌入模型。\n",
|
||||
"例如,为文本设计的嵌入模型不适用于嵌入音频或视频数据。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "490fa60b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "233806c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"从本质上讲,嵌入是一种映射,它将离散对象,如单词、图像甚至整个文档,映射到连续向量空间中的点——嵌入的主要目的是将非数字数据转换为神经网络可以处理的格式。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "86ded03a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"虽然词嵌入是文本嵌入最常见的形式,但也可以用于句子、段落或整个文档的嵌入。\n",
|
||||
"句子或段落嵌入是检索增强生成的流行选择。检索增强生成结合了生成(如产生文本)和检索(如搜索外部知识库)两种方式,以在生成文本时拉取相关信息,这些技术超出了本书讲述的范围。\n",
|
||||
"由于我们的目标是训练类似GPT的大型语言模型(LLMs),这些模型学习一次生成一个单词的文本,因此本章着重介绍词嵌入。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "acc76cd1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"目前已经开发了多种算法和框架来生成词嵌入。\n",
|
||||
"其中最早和最流行的例子之一是 Word2Vec 方法。\n",
|
||||
"Word2Vec 训练神经网络架构是通过预测给定目标词的上下文或反之来生成词嵌入。\n",
|
||||
"Word2Vec 架构的主要思想是,出现在相似上下文中的词往往具有相似的含义。\n",
|
||||
"因此,当将词嵌入投影到二维空间中以便于可视化时,可以看到相似的术语聚集在一起,如图2.3所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5792917b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图2.3 如果词嵌入是二维的,我们可以将它们绘制在二维散点图中以便于可视化,如此图所示。在使用词嵌入技术,例如 Word2Vec 时,对应于相似概念的词在嵌入空间中通常彼此接近。例如,在嵌入空间中,不同类型的鸟类相对于国家和城市更为靠近。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "92e1e8d6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6c7a776a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"词嵌入可以有不同的维度,从一维到数千维不等。\n",
|
||||
"如图2.3所示,我们可以选择二维词嵌入以便于可视化。\n",
|
||||
"更高的维度可能会捕捉到词之间更多细微的关系,但作为代价,计算效率将会下降。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7688ecd4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"虽然我们可以使用预训练模型如 Word2Vec 来为机器学习模型生成嵌入,但大型语言模型(LLMs)通常会生成它们自己的嵌入,这些嵌入是输入层的一部分,并在训练期间更新。\n",
|
||||
"将嵌入作为 LLM 训练的一部分进行优化,而不是使用 Word2Vec 的优势在于,嵌入被优化以适应手头的特定任务和数据。\n",
|
||||
"我们将在本章后面部分实现这样的嵌入层。\n",
|
||||
"此外,正如我们在第3章中讨论的,LLMs 还可以创建上下文化的输出嵌入。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2eb1d9a8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"不幸的是,高维嵌入在可视化上存在挑战,因为我们的感官感知和常见的图形表示本质上限于三维或更少,这就是为什么图2.3展示了在二维散点图中的二维嵌入。\n",
|
||||
"然而,在处理大型语言模型(LLMs)时,我们通常使用的嵌入维度远高于图2.3中显示的维度。\n",
|
||||
"对于GPT-2和GPT-3,嵌入大小(通常被称为模型隐藏状态的维度)根据具体的模型变种和大小而变化。\n",
|
||||
"这是性能与效率之间的权衡。\n",
|
||||
"最小的GPT-2(1.17亿参数)和GPT-3(1.25亿参数)模型使用768维的嵌入大小来提供具体示例。最大的GPT-3模型(1750亿参数)使用的嵌入大小为12,288维。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "725cfe23",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"本章接下来的部分将介绍为大型语言模型(LLM)准备嵌入所需的那些步骤,包括将文本分割成单词、将单词转换为令牌,以及将令牌转换为嵌入向量。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python (cell)",
|
||||
"language": "python",
|
||||
"name": "cell"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
488
Translated_Book/ch02/2.2文本分词(序列化).ipynb
Normal file
@ -0,0 +1,488 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "03fa84bd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2.2 文本分词(序列化)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4506a3e2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"本节内容涵盖了我们如何将输入文本分割成单独的令牌(token),这是为大型语言模型(LLM)创建嵌入的必需预处理步骤。\n",
|
||||
"这些标记可能是单个单词或特殊字符,包括标点符号,如图2.4所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c8ad846a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图 2.4 本节中涉及的文本处理步骤在大型语言模型(LLM)中的视图。\n",
|
||||
"在这里,我们将输入文本分割成单独的令牌(token),这些令牌(token)可能是单词或特殊字符,例如标点符号。\n",
|
||||
"在接下来的章节中,我们将把文本转换成标记 ID 并创建令牌(token)嵌入。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f2df060a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ee689dac",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们将为大型语言模型(LLM)训练分词的文本,是由伊迪丝·沃顿(Edith Wharton)创作的一部短篇小说《判决》(The Verdict),该作品版权已进入公共领域,因此我们可以用于LLM训练任务。\n",
|
||||
"这篇文章可以在 Wikisource 上找到,网址为 https://en.wikisource.org/wiki/The_Verdict ,您可以将其复制并粘贴到文本文件中,我已将其复制到名为“the-verdict.txt”的文本文件中,以便使用 Python 的标准文件读取工具加载:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "df040cd5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 代码示例 2.1:使用Python代码将短篇小说作为文本示例进行加载"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "68baa9b9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Total number of characters: 20479\n",
|
||||
"I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"raw_text = response.text\n",
|
||||
"print(\"Total number of characters:\", len(raw_text))\n",
|
||||
"print(raw_text[:99])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d1183553",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"或者,您也可以在本书的 GitHub 仓库中找到名为“the-verdict.txt”的文件,\n",
|
||||
"仓库地址为:https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01_main-chapter-code"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9b4109f8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"print 命令用于打印文件的总字符数,我们随后打印文件的前100个字符,以此来进行示例说明:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "36b04498",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Total number of character:20479\\\n",
|
||||
"I HAD always thought Jack Gisburn rather a cheap genius--thougha good fellow enough--so it was no"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ecc9cb7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们的目标是将这篇包含20,479个字符的短篇小说分词成单独的单词和特殊字符,以便在接下来的章节中将其转换成嵌入向量,用于大型语言模型(LLM)的训练。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f02d0829",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 样本文本的大小"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e9598248",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"请注意,在运行大型语言模型(LLM)时,通常会处理数百万篇文章和数十万本书——数千兆字节的文本量。\n",
|
||||
"然而,出于教学目的,使用如单本书这样的小型文本样本就已足够。这样既可以清楚地展示文本处理的主要步骤,也能确保在普通消费级硬件上在合理时间内运行。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e8d7b1cd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们如何最好地分割这段文本以获得一个令牌(token)列表?\n",
|
||||
"对此,我们将进行一次简短的探索,并使用 Python 的正则表达式库 re 模块来进行示例说明。\n",
|
||||
"(请注意,您不必学习或记住任何正则表达式的语法,因为我们将在本章后面转用预构建的分词器。)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d914f2ab",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们使用一些简单的示例文本,可以使用下面的的 re.split 命令来按空白字符分割文本:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "683ea6c8",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"text = \"Hello, world. This, is a test.\"\n",
|
||||
"result = re.split(r'(\\s)', text)\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f3632203",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"结果是一个包含单个单词、空格和标点符号的列表:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a209f23a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', '\n",
|
||||
"', 'test.']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "28d1cd96",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"请注意,上述简单的分词方案主要用于将示例文本分解成单独的词,但仍有一些单词与我们希望单独列出的标点符号相连。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "72303a4c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们修改正则表达式,在空格(\\s)以及逗号和句号([,.])处进行分割:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "7d7c3026",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = re.split(r'([,.]|\\s)', text)\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cf8254d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们可以看到,单词和标点符号现在正如我们所想要的成为了列表中的独立条目:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "850e0524",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"['Hello', ',', '', ' ', 'world.', ' ', 'This', ',', '', ' ',\n",
|
||||
"'is', ' ', 'a', ' ', 'test.']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ff3ef273",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"还有一个小问题是,列表中仍包含空白字符。\n",
|
||||
"我们可以选择安全地移除这些多余的字符,操作如下:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "504c9f05",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = [item.strip() for item in result if item.strip()]\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4cdb7d9d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"结果产生的无空白输出如下:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6aef4466",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"['Hello', ',', 'world.', 'This', ',', 'is', 'a', 'test.']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d49a92c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Removing whitespaces or not"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4f530be2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 是否移除空格"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bee6f354",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在开发一个简单的分词器时,是否应该将空格编码为单独的字符或者直接移除它们,这取决于我们的应用及其需求。\n",
|
||||
"移除空格可以减少内存和计算需求。然而,保留空格在我们训练对文本的精确结构敏感的模型时可能是有用的\n",
|
||||
"(例如,Python代码对缩进和间距非常敏感)。\n",
|
||||
"这里,我们为了简化和简洁化分词输出而移除空格。\n",
|
||||
"随后,我们将进入到一个包括空格的分词方案。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7db9c8ef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"我们在上面部分设计的分词方案在简单样本文本上表现良好。\n",
|
||||
"现在,让我们进一步修改它,让它也能处理其他类型的标点符号,\n",
|
||||
"比如问号、引号和我们在伊迪丝·沃顿的短篇小说前100个字符中能看到的双破折号,以及其他额外的特殊字符:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "2faa1386",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"text = \"Hello, world. Is this-- a test?\"\n",
|
||||
"result = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
|
||||
"result = [item.strip() for item in result if item.strip()]\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bcf02c46",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"结果输出如下所示:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1aa080e9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test',\n",
|
||||
"'?']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca1ded4a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"正如我们可以从图2.5中总结的结果看到的,\n",
|
||||
"我们的分词方案现在可以成功处理文本中的各种特殊字符。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0bd4baf9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图2.5 我们到目前为止实现的分词方案将文本分割为单独的单词和标点符号。在此图中显示的具体示例中,样本文本被分割成10个单独的标记。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f85ecff5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "15bddfbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"现在我们已经让一个基础的分词器开始运行了,\n",
|
||||
"让我们将它部署到埃迪斯·华顿的整个短篇小说上:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "d5356685",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"4649\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', raw_text)\n",
|
||||
"preprocessed = [item.strip() for item in preprocessed if\n",
|
||||
"item.strip()]\n",
|
||||
"print(len(preprocessed))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9c492a63",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"上述打印语句输出的是4649,这是该文本中的令牌(token)数量(不包括空格)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7c93d099",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们打印前30个标记以进行快速目测:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "a0865898",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(preprocessed[:30])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6f42d2f8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"结果输出显示,我们的分词器似乎很好地处理了文本,因为所有的单词和特殊字符都被整齐地分开了:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7c2d4bfe",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather',\n",
|
||||
"'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow',\n",
|
||||
"'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise',\n",
|
||||
"'to', 'me', 'to', 'hear', 'that', ',', 'in']"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python (cell)",
|
||||
"language": "python",
|
||||
"name": "cell"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
505
Translated_Book/ch02/2.3将令牌转换为令牌 ID.ipynb
Normal file
@ -0,0 +1,505 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bda8a8f9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2.3 将令牌转换为令牌 ID"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "be50c3e5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在前一小节中,我们将伊迪丝·沃顿的一个短篇故事分割成了单个令牌(token)。\n",
|
||||
"在本节中,我们将把这些令牌(token)从 Python 字符串转换成整数表示,生成所谓的令牌 ID。\n",
|
||||
"这种转换是在将令牌 ID 转换成嵌入向量之前的中间步骤。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c244c8c8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"为了将先前生成的令牌(token)映射到令牌(token) ID,我们首先需要构建一个所谓的词汇表。\n",
|
||||
"这个词汇表定义了我们如何将每个独特的词和特殊字符映射到一个独特的整数,如图 2.6 所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3e13d6b8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图 2.6 我们通过将训练数据集中的整个文本分割成单个令牌(token)来构建词汇表。\n",
|
||||
"这些单独的令牌(token)随后按字母顺序进行排序,并移除重复的令牌(token)。\n",
|
||||
"然后,将这些独特的令牌(token)聚集成一个词汇表,该词汇表定义了从每个独特令牌(token)到一个独特整数值的映射。\n",
|
||||
"所展示的词汇表为了说明目的故意保持较小,并且为了简化没有包含标点符号或特殊字符。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e843aae2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a4b01652",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在前一节中,我们对伊迪丝·沃顿的短篇故事进行了分词,并将其赋值给一个名为“preprocessed”的 Python 变量。\n",
|
||||
"现在,让我们创建一个包含所有独特令牌(token)的列表,并按字母顺序排序以确定词汇表的大小:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "b90a181f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1159\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
|
||||
"response = requests.get(url)\n",
|
||||
"raw_text = response.text\n",
|
||||
"preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', raw_text)\n",
|
||||
"preprocessed = [item.strip() for item in preprocessed if\n",
|
||||
"item.strip()]\n",
|
||||
"\n",
|
||||
"all_words = sorted(list(set(preprocessed)))\n",
|
||||
"vocab_size = len(all_words)\n",
|
||||
"print(vocab_size)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0a6c385a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在通过上述代码确定词汇表有1159个单词后,我们创建词汇表并打印其前50个单词用来展示说明。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ba8b545d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 代码示例 2.2 创建词汇表"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "32897865",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"('!', 0)\n",
|
||||
"('\"', 1)\n",
|
||||
"(\"'\", 2)\n",
|
||||
"('(', 3)\n",
|
||||
"(')', 4)\n",
|
||||
"(',', 5)\n",
|
||||
"('--', 6)\n",
|
||||
"('.', 7)\n",
|
||||
"(':', 8)\n",
|
||||
"(';', 9)\n",
|
||||
"('?', 10)\n",
|
||||
"('A', 11)\n",
|
||||
"('Ah', 12)\n",
|
||||
"('Among', 13)\n",
|
||||
"('And', 14)\n",
|
||||
"('Are', 15)\n",
|
||||
"('Arrt', 16)\n",
|
||||
"('As', 17)\n",
|
||||
"('At', 18)\n",
|
||||
"('Be', 19)\n",
|
||||
"('Begin', 20)\n",
|
||||
"('Burlington', 21)\n",
|
||||
"('But', 22)\n",
|
||||
"('By', 23)\n",
|
||||
"('Carlo', 24)\n",
|
||||
"('Carlo;', 25)\n",
|
||||
"('Chicago', 26)\n",
|
||||
"('Claude', 27)\n",
|
||||
"('Come', 28)\n",
|
||||
"('Croft', 29)\n",
|
||||
"('Destroyed', 30)\n",
|
||||
"('Devonshire', 31)\n",
|
||||
"('Don', 32)\n",
|
||||
"('Dubarry', 33)\n",
|
||||
"('Emperors', 34)\n",
|
||||
"('Florence', 35)\n",
|
||||
"('For', 36)\n",
|
||||
"('Gallery', 37)\n",
|
||||
"('Gideon', 38)\n",
|
||||
"('Gisburn', 39)\n",
|
||||
"('Gisburns', 40)\n",
|
||||
"('Grafton', 41)\n",
|
||||
"('Greek', 42)\n",
|
||||
"('Grindle', 43)\n",
|
||||
"('Grindle:', 44)\n",
|
||||
"('Grindles', 45)\n",
|
||||
"('HAD', 46)\n",
|
||||
"('Had', 47)\n",
|
||||
"('Hang', 48)\n",
|
||||
"('Has', 49)\n",
|
||||
"('He', 50)\n",
|
||||
"('Her', 51)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab = {token:integer for integer,token in\n",
|
||||
"enumerate(all_words)}\n",
|
||||
"for i, item in enumerate(vocab.items()):\n",
|
||||
" print(item)\n",
|
||||
" if i > 50:\n",
|
||||
" break"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fa2e173b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"输出结果如下:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "24e78b4a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"('!', 0) \\\n",
|
||||
"('\"', 1) \\\n",
|
||||
"(\"'\", 2) \\\n",
|
||||
"... \\\n",
|
||||
"('Has', 49) \\\n",
|
||||
"('He', 50)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a1d6aa26",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"正如我们从上面输出结果中看到的,这个字典包含了与独特整数标签相关联的单个令牌(token)。\n",
|
||||
"我们的下一个目标是应用这个词汇表,将新文本转换为令牌(token)ID,如图 2.7 所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "96f260d9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图 2.7 从一个新的文本样本开始,我们对文本进行分词,并使用词汇表将文本令牌(token)转换为令牌(token) ID。\n",
|
||||
"这个词汇表是基于整个训练集构建的,并且可以应用于训练集本身及任何新的文本示例。\n",
|
||||
"接下来所展示的词汇表为了简化起见,将不包含标点符号或特殊字符。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "187ca144",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "105417eb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在本书的后面,当我们想要将大型语言模型(LLM)的输出从数字转换回文本时,我们也需要一种方法将令牌(token) ID 转换回文本。\n",
|
||||
"为此,我们可以创建一个词汇表的逆向版本,将令牌(token) ID 映射回对应的令牌(token)标记。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a8f1de2d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们通过 Python 来实现一个完整的分词器类,其中包括一个 编码(encode) 方法,该方法将文本分割成令牌(token),并通过词汇表执行字符串到整数的映射以生成令牌(token) ID。\n",
|
||||
"此外,我们还实现一个 解码(decode) 方法,该方法执行整数到字符串的反向映射,将令牌(token) ID 转换回文本。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1f5ac1bf",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"这个分词器实现的代码如代码示例 2.3 所示:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "38b4407e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 代码示例2.3 实现一个简单的文本分词器"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "fd2603c6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class SimpleTokenizerV1:\n",
|
||||
" def __init__(self, vocab):\n",
|
||||
" self.str_to_int = vocab #A\n",
|
||||
" self.int_to_str = {i:s for s,i in vocab.items()} #B\n",
|
||||
" \n",
|
||||
" def encode(self, text): #C\n",
|
||||
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
|
||||
" preprocessed = [item.strip() for item in preprocessed\n",
|
||||
"if item.strip()]\n",
|
||||
" ids = [self.str_to_int[s] for s in preprocessed]\n",
|
||||
" return ids\n",
|
||||
" \n",
|
||||
" def decode(self, ids): #D\n",
|
||||
" text = \" \".join([self.int_to_str[i] for i in ids]) \n",
|
||||
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #E\n",
|
||||
" return text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "33b5e459",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"使用上述的 SimpleTokenizerV1 Python 类,我们现在可以通过现有的词汇表实例化新的分词器对象,然后我们可以使用它来编码和解码文本,如图 2.8 所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "643e495f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图 2.8 分词器实现共有两个常用方法:一个是编码方法,另一个是解码方法。\n",
|
||||
"编码方法接收样本文本,将其分割为单独的标记,并通过词汇表将这些标记转换为标记 ID。\n",
|
||||
"解码方法接收令牌(token) ID,将它们转换回文本令牌(token),并将这些文本令牌(token)连接成自然文本。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cdae01fd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6e245b96",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"让我们从 SimpleTokenizerV1 类实例化一个新的分词器对象,并使用它来对伊迪丝·沃顿的一段短篇故事进行分词,接下来在实践中尝试一下:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "623bb612",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"tokenizer = SimpleTokenizerV1(vocab)\n",
|
||||
"text = \"\"\"\"It's the last he painted, you know,\" Mrs. Gisburn\n",
|
||||
"said with pardonable pride.\"\"\"\n",
|
||||
"ids = tokenizer.encode(text)\n",
|
||||
"print(ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "aef8106f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"上述代码打印出以下代码的令牌(token) ID:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b3d2e527",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7,\n",
|
||||
"39, 873, 1136, 773, 812, 7]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "535622e0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"接下来,让我们看看是否可以使用解码方法将这些令牌(token) ID 转换回文本:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "ba0bc417",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said with pardonable pride.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"tokenizer.decode(ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0234e096",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"这将输出以下文本:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bf82cb8b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said\n",
|
||||
"with pardonable pride.'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eae48908",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"根据上面的输出,我们可以看到解码方法成功地将令牌(token)ID 转换回原始文本。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "84d7a151",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"到目前为止,一切顺利。\n",
|
||||
"这样,我们就构建了一个分词器,能够根据训练集中的一个片段对文本进行分词和解码。\n",
|
||||
"现在,让我们将其应用于训练集中未包含的一个新文本示例:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "fe01788d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "KeyError",
|
||||
"evalue": "'Hello'",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
|
||||
"Cell \u001b[1;32mIn[6], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m text \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHello, do you like tea?\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m----> 2\u001b[0m \u001b[43mtokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m)\u001b[49m\n",
|
||||
"Cell \u001b[1;32mIn[3], line 10\u001b[0m, in \u001b[0;36mSimpleTokenizerV1.encode\u001b[1;34m(self, text)\u001b[0m\n\u001b[0;32m 7\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m re\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m([,.?_!\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m()\u001b[39m\u001b[38;5;130;01m\\'\u001b[39;00m\u001b[38;5;124m]|--|\u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124ms)\u001b[39m\u001b[38;5;124m'\u001b[39m, text)\n\u001b[0;32m 8\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m [item\u001b[38;5;241m.\u001b[39mstrip() \u001b[38;5;28;01mfor\u001b[39;00m item \u001b[38;5;129;01min\u001b[39;00m preprocessed\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m item\u001b[38;5;241m.\u001b[39mstrip()]\n\u001b[1;32m---> 10\u001b[0m ids \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstr_to_int[s] \u001b[38;5;28;01mfor\u001b[39;00m s \u001b[38;5;129;01min\u001b[39;00m preprocessed]\n\u001b[0;32m 11\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ids\n",
|
||||
"Cell \u001b[1;32mIn[3], line 10\u001b[0m, in \u001b[0;36m<listcomp>\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m 7\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m re\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m([,.?_!\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m()\u001b[39m\u001b[38;5;130;01m\\'\u001b[39;00m\u001b[38;5;124m]|--|\u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124ms)\u001b[39m\u001b[38;5;124m'\u001b[39m, text)\n\u001b[0;32m 8\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m [item\u001b[38;5;241m.\u001b[39mstrip() \u001b[38;5;28;01mfor\u001b[39;00m item \u001b[38;5;129;01min\u001b[39;00m preprocessed\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m item\u001b[38;5;241m.\u001b[39mstrip()]\n\u001b[1;32m---> 10\u001b[0m ids \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstr_to_int\u001b[49m\u001b[43m[\u001b[49m\u001b[43ms\u001b[49m\u001b[43m]\u001b[49m \u001b[38;5;28;01mfor\u001b[39;00m s \u001b[38;5;129;01min\u001b[39;00m preprocessed]\n\u001b[0;32m 11\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ids\n",
|
||||
"\u001b[1;31mKeyError\u001b[0m: 'Hello'"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"text = \"Hello, do you like tea?\"\n",
|
||||
"tokenizer.encode(text)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "55442846",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"执行上述代码将导致以下错误:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d41b8330",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"...\n",
|
||||
"KeyError: 'Hello'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3777ed61",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"问题在于单词 \"Hello\" 没有出现在《判决》这部短篇故事中。\n",
|
||||
"所以,这个单词不包含在我们之前构建的词汇表中。\n",
|
||||
"这突显出在处理大型语言模型(LLM)时,考虑使用大规模且多样化的训练集以扩展词汇表的重要性。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "68448d76",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在下一节中,我们将进一步测试分词器处理包含未知词汇的文本,\n",
|
||||
"并且我们还将讨论可以用来在训练过程中为大型语言模型(LLM)提供更多上下文的额外特殊令牌(token)。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python (cell)",
|
||||
"language": "python",
|
||||
"name": "cell"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
100
Translated_Book/ch02/2.文本数据处理.ipynb
Normal file
@ -0,0 +1,100 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fa309c4a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2 文本数据处理"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8c769445",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**本章内容**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "30ef7704",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 为大型语言模型训练准备文本\n",
|
||||
"- 将文本分割为词汇和子词汇令牌(token)\n",
|
||||
"- 字节对编码是一种更高级的文本分词方法\n",
|
||||
"- 使用滑动窗口方法抽样训练示例\n",
|
||||
"- 将令牌(token)转换为向量输入大型语言模型"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e53c667f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在前一章中,我们探讨了大型语言模型(LLM)的一般结构,并了解到它们是在大量文本上进行预训练的。\n",
|
||||
"具体来说,我们关注的是基于变transomer架构的仅解码器模式的大型语言模型(LLMs),这种架构是ChatGPT以及其他流行的类似GPT的LLM的基础。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0a008526",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在预训练阶段,大型语言模型(LLMs)逐词处理文本。\n",
|
||||
"使用下一个词预测任务训练具有数百万到数十亿参数的大型语言模型,能够产生卓越能力的模型。\n",
|
||||
"这些模型可以进一步微调,以遵循一般指令或执行特定的目标任务。\n",
|
||||
"但在在接下来的章节中,部署和训练大型语言模型(LLMs)之前,我们需要先准备训练数据集,这也是本章的主要内容,如图2.1所示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7b4d6b77",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**图 2.1 展示了构建大型语言模型(LLM)的三个主要阶段:在通用文本数据集上预训练LLM,以及在标注数据集上对其进行微调。本章将解释并构建为LLM提供预训练文本数据的数据准备和采样流程。**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "972f6e5a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2b28b74e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"在本章中,您将学习如何为训练大型语言模型(LLMs)准备输入文本。\n",
|
||||
"这包括将文本分割成单个词汇和子词汇令牌(token),然后将它们编码成向量表示,供大型语言模型(LLM)使用。\n",
|
||||
"您还将了解字节对编码等高级分词方案,这些方案已经在GPT等流行的大型语言模型(LLMs)中得到应用。\n",
|
||||
"最后,我们将介绍采样和数据加载策略,这些策略用于生成后续章节中训练大型语言模型(LLMs所需的输入输出对。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python (cell)",
|
||||
"language": "python",
|
||||
"name": "cell"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
BIN
Translated_Book/img/cover-1.jpg
Normal file
|
After Width: | Height: | Size: 83 KiB |
BIN
Translated_Book/img/cover-2.jpg
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
Translated_Book/img/fig-1-1.jpg
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
Translated_Book/img/fig-1-2.jpg
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
Translated_Book/img/fig-1-3.jpg
Normal file
|
After Width: | Height: | Size: 104 KiB |
BIN
Translated_Book/img/fig-1-4.jpg
Normal file
|
After Width: | Height: | Size: 126 KiB |
BIN
Translated_Book/img/fig-1-5.jpg
Normal file
|
After Width: | Height: | Size: 106 KiB |
BIN
Translated_Book/img/fig-1-6.png
Normal file
|
After Width: | Height: | Size: 132 KiB |
BIN
Translated_Book/img/fig-1-7.jpg
Normal file
|
After Width: | Height: | Size: 56 KiB |
BIN
Translated_Book/img/fig-1-8.jpg
Normal file
|
After Width: | Height: | Size: 104 KiB |
BIN
Translated_Book/img/fig-1-9.jpg
Normal file
|
After Width: | Height: | Size: 98 KiB |
BIN
Translated_Book/img/fig-2-1.jpg
Normal file
|
After Width: | Height: | Size: 106 KiB |
BIN
Translated_Book/img/fig-2-10.jpg
Normal file
|
After Width: | Height: | Size: 150 KiB |
BIN
Translated_Book/img/fig-2-11.jpg
Normal file
|
After Width: | Height: | Size: 90 KiB |
BIN
Translated_Book/img/fig-2-12.jpg
Normal file
|
After Width: | Height: | Size: 134 KiB |
BIN
Translated_Book/img/fig-2-13.jpg
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
Translated_Book/img/fig-2-14.jpg
Normal file
|
After Width: | Height: | Size: 135 KiB |
BIN
Translated_Book/img/fig-2-15.jpg
Normal file
|
After Width: | Height: | Size: 87 KiB |
BIN
Translated_Book/img/fig-2-16.jpg
Normal file
|
After Width: | Height: | Size: 116 KiB |
BIN
Translated_Book/img/fig-2-17.jpg
Normal file
|
After Width: | Height: | Size: 126 KiB |
BIN
Translated_Book/img/fig-2-18.jpg
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
Translated_Book/img/fig-2-19.jpg
Normal file
|
After Width: | Height: | Size: 148 KiB |
BIN
Translated_Book/img/fig-2-2.jpg
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
Translated_Book/img/fig-2-3.jpg
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
Translated_Book/img/fig-2-4.jpg
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
Translated_Book/img/fig-2-5.jpg
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
Translated_Book/img/fig-2-6.jpg
Normal file
|
After Width: | Height: | Size: 134 KiB |
BIN
Translated_Book/img/fig-2-7.jpg
Normal file
|
After Width: | Height: | Size: 121 KiB |
BIN
Translated_Book/img/fig-2-8.jpg
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
Translated_Book/img/fig-2-9.jpg
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
Translated_Book/img/fig-3-1.jpg
Normal file
|
After Width: | Height: | Size: 76 KiB |
BIN
Translated_Book/img/fig-3-10.jpg
Normal file
|
After Width: | Height: | Size: 87 KiB |
BIN
Translated_Book/img/fig-3-11.jpg
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
Translated_Book/img/fig-3-12.jpg
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
Translated_Book/img/fig-3-13.jpg
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
Translated_Book/img/fig-3-14.jpg
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
Translated_Book/img/fig-3-15.jpg
Normal file
|
After Width: | Height: | Size: 73 KiB |
BIN
Translated_Book/img/fig-3-16.jpg
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
Translated_Book/img/fig-3-17.jpg
Normal file
|
After Width: | Height: | Size: 89 KiB |
BIN
Translated_Book/img/fig-3-18.jpg
Normal file
|
After Width: | Height: | Size: 197 KiB |
BIN
Translated_Book/img/fig-3-19.jpg
Normal file
|
After Width: | Height: | Size: 127 KiB |
BIN
Translated_Book/img/fig-3-2.jpg
Normal file
|
After Width: | Height: | Size: 81 KiB |
BIN
Translated_Book/img/fig-3-20.jpg
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
Translated_Book/img/fig-3-21.jpg
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
Translated_Book/img/fig-3-22.jpg
Normal file
|
After Width: | Height: | Size: 172 KiB |
BIN
Translated_Book/img/fig-3-23.jpg
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
Translated_Book/img/fig-3-24.jpg
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
Translated_Book/img/fig-3-25.jpg
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
Translated_Book/img/fig-3-26.jpg
Normal file
|
After Width: | Height: | Size: 116 KiB |
BIN
Translated_Book/img/fig-3-3.jpg
Normal file
|
After Width: | Height: | Size: 114 KiB |
BIN
Translated_Book/img/fig-3-4.jpg
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
Translated_Book/img/fig-3-5.jpg
Normal file
|
After Width: | Height: | Size: 83 KiB |
BIN
Translated_Book/img/fig-3-6.jpg
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
Translated_Book/img/fig-3-7.jpg
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
Translated_Book/img/fig-3-8.jpg
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
Translated_Book/img/fig-3-9.jpg
Normal file
|
After Width: | Height: | Size: 66 KiB |
BIN
Translated_Book/img/fig-4-1.jpg
Normal file
|
After Width: | Height: | Size: 90 KiB |
BIN
Translated_Book/img/fig-4-10.jpg
Normal file
|
After Width: | Height: | Size: 168 KiB |
BIN
Translated_Book/img/fig-4-11.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
Translated_Book/img/fig-4-12.jpg
Normal file
|
After Width: | Height: | Size: 138 KiB |
BIN
Translated_Book/img/fig-4-13.jpg
Normal file
|
After Width: | Height: | Size: 168 KiB |
BIN
Translated_Book/img/fig-4-14.jpg
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
Translated_Book/img/fig-4-15.jpg
Normal file
|
After Width: | Height: | Size: 169 KiB |
BIN
Translated_Book/img/fig-4-16.jpg
Normal file
|
After Width: | Height: | Size: 156 KiB |
BIN
Translated_Book/img/fig-4-17.jpg
Normal file
|
After Width: | Height: | Size: 113 KiB |
BIN
Translated_Book/img/fig-4-18.jpg
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
Translated_Book/img/fig-4-2.jpg
Normal file
|
After Width: | Height: | Size: 123 KiB |
BIN
Translated_Book/img/fig-4-3.jpg
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
Translated_Book/img/fig-4-4.jpg
Normal file
|
After Width: | Height: | Size: 151 KiB |
BIN
Translated_Book/img/fig-4-5.jpg
Normal file
|
After Width: | Height: | Size: 154 KiB |
BIN
Translated_Book/img/fig-4-6.jpg
Normal file
|
After Width: | Height: | Size: 149 KiB |
BIN
Translated_Book/img/fig-4-7.jpg
Normal file
|
After Width: | Height: | Size: 92 KiB |
BIN
Translated_Book/img/fig-4-8.jpg
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
Translated_Book/img/fig-4-9.jpg
Normal file
|
After Width: | Height: | Size: 130 KiB |
BIN
Translated_Book/img/fig-5-1.jpg
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
Translated_Book/img/fig-5-10.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
Translated_Book/img/fig-5-11.jpg
Normal file
|
After Width: | Height: | Size: 203 KiB |
BIN
Translated_Book/img/fig-5-12.jpg
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
Translated_Book/img/fig-5-13.jpg
Normal file
|
After Width: | Height: | Size: 91 KiB |
BIN
Translated_Book/img/fig-5-14.jpg
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
Translated_Book/img/fig-5-15.jpg
Normal file
|
After Width: | Height: | Size: 109 KiB |
BIN
Translated_Book/img/fig-5-16.jpg
Normal file
|
After Width: | Height: | Size: 94 KiB |
BIN
Translated_Book/img/fig-5-17.jpg
Normal file
|
After Width: | Height: | Size: 138 KiB |
BIN
Translated_Book/img/fig-5-2.jpg
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
Translated_Book/img/fig-5-3.png
Normal file
|
After Width: | Height: | Size: 118 KiB |
BIN
Translated_Book/img/fig-5-4.jpg
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
Translated_Book/img/fig-5-5.png
Normal file
|
After Width: | Height: | Size: 158 KiB |
BIN
Translated_Book/img/fig-5-6.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
Translated_Book/img/fig-5-7.jpg
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
Translated_Book/img/fig-5-8.jpg
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
Translated_Book/img/fig-5-9.jpg
Normal file
|
After Width: | Height: | Size: 214 KiB |
BIN
Translated_Book/img/fig-A-1.jpg
Normal file
|
After Width: | Height: | Size: 94 KiB |
BIN
Translated_Book/img/fig-A-10.jpg
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
Translated_Book/img/fig-A-11.jpg
Normal file
|
After Width: | Height: | Size: 97 KiB |
BIN
Translated_Book/img/fig-A-12.jpg
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
Translated_Book/img/fig-A-13.jpg
Normal file
|
After Width: | Height: | Size: 66 KiB |