llms-from-scratch-cn/Translated_Book/ch02/2.文本数据处理.ipynb
2026-03-26 12:16:45 +08:00

101 lines
3.2 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "fa309c4a",
"metadata": {},
"source": [
"# 2 文本数据处理"
]
},
{
"cell_type": "markdown",
"id": "8c769445",
"metadata": {},
"source": [
"**本章内容**"
]
},
{
"cell_type": "markdown",
"id": "30ef7704",
"metadata": {},
"source": [
"- 为大型语言模型训练准备文本\n",
"- 将文本分割为词汇和子词汇token\n",
"- 字节对编码是一种更高级的文本分词方法\n",
"- 使用滑动窗口方法抽样训练示例\n",
"- 将token转换为向量输入大型语言模型"
]
},
{
"cell_type": "markdown",
"id": "e53c667f",
"metadata": {},
"source": [
"在前一章中我们探讨了大型语言模型LLM的一般结构并了解到它们是在大量文本上进行预训练的。\n",
"具体来说我们关注的是基于变transomer架构的仅解码器模式的大型语言模型(LLMs)这种架构是ChatGPT以及其他流行的类似GPT的LLM的基础。"
]
},
{
"cell_type": "markdown",
"id": "0a008526",
"metadata": {},
"source": [
"在预训练阶段大型语言模型LLMs逐词处理文本。\n",
"使用下一个词预测任务训练具有数百万到数十亿参数的大型语言模型,能够产生卓越能力的模型。\n",
"这些模型可以进一步微调,以遵循一般指令或执行特定的目标任务。\n",
"但在在接下来的章节中部署和训练大型语言模型LLMs之前我们需要先准备训练数据集这也是本章的主要内容如图2.1所示。"
]
},
{
"cell_type": "markdown",
"id": "7b4d6b77",
"metadata": {},
"source": [
"**图 2.1 展示了构建大型语言模型LLM的三个主要阶段在通用文本数据集上预训练LLM以及在标注数据集上对其进行微调。本章将解释并构建为LLM提供预训练文本数据的数据准备和采样流程。**"
]
},
{
"cell_type": "markdown",
"id": "972f6e5a",
"metadata": {},
"source": [
"![fig2.1](../img/fig-2-1.jpg)"
]
},
{
"cell_type": "markdown",
"id": "2b28b74e",
"metadata": {},
"source": [
"在本章中您将学习如何为训练大型语言模型LLMs准备输入文本。\n",
"这包括将文本分割成单个词汇和子词汇token然后将它们编码成向量表示供大型语言模型LLM使用。\n",
"您还将了解字节对编码等高级分词方案这些方案已经在GPT等流行的大型语言模型LLMs中得到应用。\n",
"最后我们将介绍采样和数据加载策略这些策略用于生成后续章节中训练大型语言模型LLMs)所需的输入输出对。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (cell)",
"language": "python",
"name": "cell"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}