mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-05-01 11:58:17 +08:00
506 lines
16 KiB
Plaintext
506 lines
16 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bda8a8f9",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 2.3 将token转换为token ID"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "be50c3e5",
|
||
"metadata": {},
|
||
"source": [
|
||
"在前一小节中,我们将伊迪丝·沃顿的一个短篇故事分割成了单个token。\n",
|
||
"在本节中,我们将把这些token)从 Python 字符串转换成整数表示,生成所谓的token ID。\n",
|
||
"这种转换是在将token ID 转换成嵌入向量之前的中间步骤。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c244c8c8",
|
||
"metadata": {},
|
||
"source": [
|
||
"为了将先前生成的token映射到token ID,我们首先需要构建一个所谓的词汇表。\n",
|
||
"这个词汇表定义了我们如何将每个独特的词和特殊字符映射到一个独特的整数,如图 2.6 所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3e13d6b8",
|
||
"metadata": {},
|
||
"source": [
|
||
"**图 2.6 我们通过将训练数据集中的整个文本分割成单个token来构建词汇表。\n",
|
||
"这些单独的token随后按字母顺序进行排序,并移除重复的token。\n",
|
||
"然后,将这些独特的token聚集成一个词汇表,该词汇表定义了从每个独特token到一个独特整数值的映射。\n",
|
||
"所展示的词汇表为了说明目的故意保持词汇量较小,并且为了简化没有包含标点符号或特殊字符。**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e843aae2",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a4b01652",
|
||
"metadata": {},
|
||
"source": [
|
||
"在前一节中,我们对伊迪丝·沃顿的短篇故事进行了分词,并将其赋值给一个名为“preprocessed”的 Python 变量。\n",
|
||
"现在,让我们创建一个包含所有独特token的列表,并按字母顺序排序以确定词汇表的大小:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "b90a181f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"1159\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import re\n",
|
||
"import requests\n",
|
||
"\n",
|
||
"url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n",
|
||
"response = requests.get(url)\n",
|
||
"raw_text = response.text\n",
|
||
"preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', raw_text)\n",
|
||
"preprocessed = [item.strip() for item in preprocessed if\n",
|
||
"item.strip()]\n",
|
||
"\n",
|
||
"all_words = sorted(list(set(preprocessed)))\n",
|
||
"vocab_size = len(all_words)\n",
|
||
"print(vocab_size)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0a6c385a",
|
||
"metadata": {},
|
||
"source": [
|
||
"在通过上述代码确定词汇表有1159个单词后,我们创建词汇表并打印其前50个单词用来展示说明。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ba8b545d",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 代码示例 2.2 创建词汇表"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "32897865",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"('!', 0)\n",
|
||
"('\"', 1)\n",
|
||
"(\"'\", 2)\n",
|
||
"('(', 3)\n",
|
||
"(')', 4)\n",
|
||
"(',', 5)\n",
|
||
"('--', 6)\n",
|
||
"('.', 7)\n",
|
||
"(':', 8)\n",
|
||
"(';', 9)\n",
|
||
"('?', 10)\n",
|
||
"('A', 11)\n",
|
||
"('Ah', 12)\n",
|
||
"('Among', 13)\n",
|
||
"('And', 14)\n",
|
||
"('Are', 15)\n",
|
||
"('Arrt', 16)\n",
|
||
"('As', 17)\n",
|
||
"('At', 18)\n",
|
||
"('Be', 19)\n",
|
||
"('Begin', 20)\n",
|
||
"('Burlington', 21)\n",
|
||
"('But', 22)\n",
|
||
"('By', 23)\n",
|
||
"('Carlo', 24)\n",
|
||
"('Carlo;', 25)\n",
|
||
"('Chicago', 26)\n",
|
||
"('Claude', 27)\n",
|
||
"('Come', 28)\n",
|
||
"('Croft', 29)\n",
|
||
"('Destroyed', 30)\n",
|
||
"('Devonshire', 31)\n",
|
||
"('Don', 32)\n",
|
||
"('Dubarry', 33)\n",
|
||
"('Emperors', 34)\n",
|
||
"('Florence', 35)\n",
|
||
"('For', 36)\n",
|
||
"('Gallery', 37)\n",
|
||
"('Gideon', 38)\n",
|
||
"('Gisburn', 39)\n",
|
||
"('Gisburns', 40)\n",
|
||
"('Grafton', 41)\n",
|
||
"('Greek', 42)\n",
|
||
"('Grindle', 43)\n",
|
||
"('Grindle:', 44)\n",
|
||
"('Grindles', 45)\n",
|
||
"('HAD', 46)\n",
|
||
"('Had', 47)\n",
|
||
"('Hang', 48)\n",
|
||
"('Has', 49)\n",
|
||
"('He', 50)\n",
|
||
"('Her', 51)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"vocab = {token:integer for integer,token in\n",
|
||
"enumerate(all_words)}\n",
|
||
"for i, item in enumerate(vocab.items()):\n",
|
||
" print(item)\n",
|
||
" if i > 50:\n",
|
||
" break"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fa2e173b",
|
||
"metadata": {},
|
||
"source": [
|
||
"输出结果如下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "24e78b4a",
|
||
"metadata": {},
|
||
"source": [
|
||
"('!', 0) \\\n",
|
||
"('\"', 1) \\\n",
|
||
"(\"'\", 2) \\\n",
|
||
"... \\\n",
|
||
"('Has', 49) \\\n",
|
||
"('He', 50)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a1d6aa26",
|
||
"metadata": {},
|
||
"source": [
|
||
"正如我们从上面输出结果中看到的,这个字典包含了与独特整数标签相关联的单个token。\n",
|
||
"我们的下一个目标是应用这个词汇表,将新文本转换为tokenID,如图 2.7 所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96f260d9",
|
||
"metadata": {},
|
||
"source": [
|
||
"**图 2.7 从一个新的文本样本开始,我们对文本进行分词,并使用词汇表将文本token转换为token ID。\n",
|
||
"这个词汇表是基于整个训练集构建的,并且可以应用于训练集本身及任何新的文本示例。\n",
|
||
"接下来所展示的词汇表为了简化起见,将不包含标点符号或特殊字符。**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "187ca144",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "105417eb",
|
||
"metadata": {},
|
||
"source": [
|
||
"在本书的后面,当我们想要将大型语言模型(LLM)的输出从数字转换回文本时,我们也需要一种方法将token ID 转换回文本。\n",
|
||
"为此,我们可以创建一个词汇表的逆向版本,将token ID 映射回对应的token标记。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a8f1de2d",
|
||
"metadata": {},
|
||
"source": [
|
||
"让我们通过 Python 来实现一个完整的分词器类,其中包括一个 编码(encode) 方法,该方法将文本分割成token,并通过词汇表执行字符串到整数的映射以生成token ID。\n",
|
||
"此外,我们还实现一个 解码(decode) 方法,该方法执行整数到字符串的反向映射,将token ID 转换回文本。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1f5ac1bf",
|
||
"metadata": {},
|
||
"source": [
|
||
"这个分词器实现的代码如代码示例 2.3 所示:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "38b4407e",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 代码示例2.3 实现一个简单的文本分词器"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "fd2603c6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class SimpleTokenizerV1:\n",
|
||
" def __init__(self, vocab):\n",
|
||
" self.str_to_int = vocab #A\n",
|
||
" self.int_to_str = {i:s for s,i in vocab.items()} #B\n",
|
||
" \n",
|
||
" def encode(self, text): #C\n",
|
||
" preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
|
||
" preprocessed = [item.strip() for item in preprocessed\n",
|
||
"if item.strip()]\n",
|
||
" ids = [self.str_to_int[s] for s in preprocessed]\n",
|
||
" return ids\n",
|
||
" \n",
|
||
" def decode(self, ids): #D\n",
|
||
" text = \" \".join([self.int_to_str[i] for i in ids]) \n",
|
||
" text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #E\n",
|
||
" return text"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "33b5e459",
|
||
"metadata": {},
|
||
"source": [
|
||
"使用上述的 SimpleTokenizerV1 Python 类,我们现在可以通过现有的词汇表实例化新的分词器对象,然后我们可以使用它来编码和解码文本,如图 2.8 所示。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "643e495f",
|
||
"metadata": {},
|
||
"source": [
|
||
"**图 2.8 分词器实现共有两个常用方法:一个是编码方法,另一个是解码方法。\n",
|
||
"编码方法接收样本文本,将其分割为单独的标记,并通过词汇表将这些标记转换为标记 ID。\n",
|
||
"解码方法接收token ID,将它们转换回文本token,并将这些文本token连接成自然文本。**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cdae01fd",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6e245b96",
|
||
"metadata": {},
|
||
"source": [
|
||
"让我们从 SimpleTokenizerV1 类实例化一个新的分词器对象,并使用它来对伊迪丝·沃顿的一段短篇故事进行分词,接下来在实践中尝试一下:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "623bb612",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"tokenizer = SimpleTokenizerV1(vocab)\n",
|
||
"text = \"\"\"\"It's the last he painted, you know,\" Mrs. Gisburn\n",
|
||
"said with pardonable pride.\"\"\"\n",
|
||
"ids = tokenizer.encode(text)\n",
|
||
"print(ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "aef8106f",
|
||
"metadata": {},
|
||
"source": [
|
||
"上述代码打印出以下代码的令牌(token) ID:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b3d2e527",
|
||
"metadata": {},
|
||
"source": [
|
||
"[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7,\n",
|
||
"39, 873, 1136, 773, 812, 7]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "535622e0",
|
||
"metadata": {},
|
||
"source": [
|
||
"接下来,让我们看看是否可以使用解码方法将这些令牌(token) ID 转换回文本:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "ba0bc417",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said with pardonable pride.'"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tokenizer.decode(ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0234e096",
|
||
"metadata": {},
|
||
"source": [
|
||
"这将输出以下文本:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bf82cb8b",
|
||
"metadata": {},
|
||
"source": [
|
||
"'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said\n",
|
||
"with pardonable pride.'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eae48908",
|
||
"metadata": {},
|
||
"source": [
|
||
"根据上面的输出,我们可以看到解码方法成功地将tokenID 转换回原始文本。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "84d7a151",
|
||
"metadata": {},
|
||
"source": [
|
||
"到目前为止,一切顺利。\n",
|
||
"这样,我们就构建了一个分词器,能够根据训练集中的一个片段对文本进行分词和解码。\n",
|
||
"现在,让我们将其应用于训练集中未包含的一个新文本示例:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "fe01788d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "KeyError",
|
||
"evalue": "'Hello'",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[1;32mIn[6], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m text \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHello, do you like tea?\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m----> 2\u001b[0m \u001b[43mtokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"Cell \u001b[1;32mIn[3], line 10\u001b[0m, in \u001b[0;36mSimpleTokenizerV1.encode\u001b[1;34m(self, text)\u001b[0m\n\u001b[0;32m 7\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m re\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m([,.?_!\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m()\u001b[39m\u001b[38;5;130;01m\\'\u001b[39;00m\u001b[38;5;124m]|--|\u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124ms)\u001b[39m\u001b[38;5;124m'\u001b[39m, text)\n\u001b[0;32m 8\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m [item\u001b[38;5;241m.\u001b[39mstrip() \u001b[38;5;28;01mfor\u001b[39;00m item \u001b[38;5;129;01min\u001b[39;00m preprocessed\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m item\u001b[38;5;241m.\u001b[39mstrip()]\n\u001b[1;32m---> 10\u001b[0m ids \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstr_to_int[s] \u001b[38;5;28;01mfor\u001b[39;00m s \u001b[38;5;129;01min\u001b[39;00m preprocessed]\n\u001b[0;32m 11\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ids\n",
|
||
"Cell \u001b[1;32mIn[3], line 10\u001b[0m, in \u001b[0;36m<listcomp>\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m 7\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m re\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m([,.?_!\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m()\u001b[39m\u001b[38;5;130;01m\\'\u001b[39;00m\u001b[38;5;124m]|--|\u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124ms)\u001b[39m\u001b[38;5;124m'\u001b[39m, text)\n\u001b[0;32m 8\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m [item\u001b[38;5;241m.\u001b[39mstrip() \u001b[38;5;28;01mfor\u001b[39;00m item \u001b[38;5;129;01min\u001b[39;00m preprocessed\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m item\u001b[38;5;241m.\u001b[39mstrip()]\n\u001b[1;32m---> 10\u001b[0m ids \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstr_to_int\u001b[49m\u001b[43m[\u001b[49m\u001b[43ms\u001b[49m\u001b[43m]\u001b[49m \u001b[38;5;28;01mfor\u001b[39;00m s \u001b[38;5;129;01min\u001b[39;00m preprocessed]\n\u001b[0;32m 11\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ids\n",
|
||
"\u001b[1;31mKeyError\u001b[0m: 'Hello'"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"text = \"Hello, do you like tea?\"\n",
|
||
"tokenizer.encode(text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "55442846",
|
||
"metadata": {},
|
||
"source": [
|
||
"执行上述代码将导致以下错误:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d41b8330",
|
||
"metadata": {},
|
||
"source": [
|
||
"...\n",
|
||
"KeyError: 'Hello'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3777ed61",
|
||
"metadata": {},
|
||
"source": [
|
||
"问题在于单词 \"Hello\" 没有出现在《判决》这部短篇故事中。\n",
|
||
"所以,这个单词不包含在我们之前构建的词汇表中。\n",
|
||
"这突显出在处理大型语言模型(LLM)时,考虑使用大规模且多样化的训练集以扩展词汇表的重要性。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "68448d76",
|
||
"metadata": {},
|
||
"source": [
|
||
"在下一节中,我们将进一步测试分词器处理包含未知词汇的文本,\n",
|
||
"并且我们还将讨论可以用来在训练过程中为大型语言模型(LLM)提供更多上下文的额外特殊token。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python (cell)",
|
||
"language": "python",
|
||
"name": "cell"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.13"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|