20240228

2026-06-06 00:04:42 +00:00 · 2024-02-28 23:31:55 +08:00
parent 2033051002
commit 1af5c3b2e9
118 changed files with 69148 additions and 92 deletions
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023-2024 Sebastian Raschka
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -1,10 +1,16 @@
-# 项目名称
+# 动手实现LLM中文版

-这里写项目的各类介绍信息，例如：
+GitHub上的"rasbt/LLMs-from-scratch"项目是一个关于如何从头开始实现类似ChatGPT的大语言模型（LLM）的教程。这个项目包含了编码、预训练和微调GPT-like LLM的代码，并且是《Build a Large Language Model (From Scratch)》这本书的官方代码库。书中详细介绍了LLM的内部工作原理，并逐步指导读者创建自己的LLM，包括每个阶段的清晰文本、图表和示例。这种方法用于训练和开发自己的小型但功能性的模型，用于教育目的，与创建大型基础模型（如ChatGPT背后的模型）的方法相似，翻译后的版本可以服务于国内的开发者。

- 项目背景、动机等简介内容
- 项目内容目录
- ....
+- 项目受众
+    - 技术背景：该项目适合有一定编程基础的人员，特别是对大型语言模型（LLM）感兴趣的开发者和研究者。
+    - 学习目标：适合那些希望深入了解LLM工作原理，并愿意投入时间从零开始构建和训练自己的LLM的学习者。
+    - 应用领域：适用于对自然语言处理、人工智能领域感兴趣的开发者，以及希望在教育或研究环境中应用LLM的人员。
+
+- 项目亮点
+    - 系统化学习：该项目提供了一个系统化的学习路径，从理论基础到实际编码，帮助学习者全面理解LLM。
+    - 实践导向：与仅仅介绍理论或API使用不同，该项目强调实践，让学习者通过实际操作来掌握LLM的开发和训练。
+    - 深入浅出：该项目以清晰的语言、图表和示例来解释复杂的概念，使得非专业背景的学习者也能较好地理解。

 ## Roadmap

@@ -22,11 +28,17 @@

 | 姓名 | 职责 | 简介 |
 | :----| :---- | :---- |
-| 小明 | 项目负责人 | 小明 |
-| 小红 | 第1章贡献者 | 小明的朋友 |
-| 小强 | 第2章贡献者 | 小明的朋友 |
+| 陈可为 | 项目负责人 | 华中科技大学 |
+| 王训志 | 第2章贡献者 |  |
+| 汪健麟 | 第2章贡献者 |  |
+| 张友东 | 第3章贡献者 |  |
+| 邹雨衡 | 第3章贡献者 |  |
+| 陈嘉诺 | 第4章贡献者 |  |
+| 高立业 | 第4章贡献者 |  |
+| 周景林 | 附录贡献者 |  |
+| 陈可为 | 附录贡献者 |  |
+

-*注：表头可自定义，但必须在名单中标明项目负责人*

 ## 关注我们

@@ -0,0 +1,111 @@
+# Python 设置提示
+
+
+
+有几种不同的方法可以安装 Python 并设置您的计算环境。在这里，我将介绍我的个人偏好。
+
+（我使用运行 macOS 的计算机，但此工作流程对于运行 Linux 的计算机是类似的，并且可能也适用于其他操作系统。）
+
+<br>
+<br>
+
+## 1. 下载并安装 Miniforge
+
+从 GitHub 仓库 [这里](https://github.com/conda-forge/miniforge) 下载 miniforge。
+
+<img src="figures/download.png" alt="download" width="600px">
+
+根据您的操作系统，这应该会下载一个 `.sh`（macOS，Linux）或 `.exe` 文件（Windows）。
+
+对于 `.sh` 文件，请打开您的命令行终端并执行以下命令
+
+```bash
+sh ~/Desktop/Miniforge3-MacOSX-arm64.sh
+```
+
+其中 `Desktop/` 是 Miniforge 安装程序下载到的文件夹。在您的计算机上，您可能需要用 `Downloads/` 替换它。
+
+<img src="figures/miniforge-install.png" alt="miniforge-install" width="600px">
+
+接下来，按照下载说明步骤进行操作，并按下 "Enter" 确认。
+
+如果您使用许多包，Conda 可能会因为其彻底但复杂的依赖解析过程以及处理大型包索引和元数据而变慢。为了加快 Conda 的速度，您可以使用以下设置，它将切换到更有效的 Rust 重新实现以解决依赖关系：
+
+```
+conda config --set solver libmamba
+```
+
+<br>
+<br>
+
+## 2. 创建一个新的虚拟环境
+
+安装成功后，我建议创建一个名为 `dl-fundamentals` 的新虚拟环境，您可以通过执行以下命令来完成
+
+```bash
+conda create -n LLMs python=3.10
+```
+
+<img src="figures/new-env.png" alt="new-env" width="600px">
+
+> 许多科学计算库不会立即支持最新版本的 Python。因此，在安装 PyTorch 时，建议使用较旧的 Python 版本，即一两个版本。例如，如果最新版本的 Python 是 3.13，则建议使用 Python 3.10 或 3.11。
+
+接下来，激活您的新虚拟环境（每次打开新的终端窗口或选项卡时都必须执行）：
+
+```bash
+conda activate dl-workshop
+```
+
+<img src="figures/activate-env.png" alt="activate-env" width="600px">
+
+<br>
+<br>
+
+## 可选: 美化您的终端
+
+如果您想将终端样式设置为与我的类似，以便您可以看到哪个虚拟环境是活动的，请查看 [Oh My Zsh](https://github.com/ohmyzsh/ohmyzsh) 项目。
+
+<br>
+<br>
+
+## 3. 安装新的 Python 库
+
+
+
+要安装新的 Python 库，您现在可以使用 `conda` 包安装程序。例如，您可以安装 [JupyterLab](https://jupyter.org/install) 和 [watermark](https://github.com/rasbt/watermark) 如下：
+
+```bash
+conda install jupyterlab watermark
+```
+
+<img src="figures/conda-install.png" alt="conda-install" width="600px">
+
+您也仍然可以使用 `pip` 安装库。默认情况下，`pip` 应该已链接到您的新的 `LLms` conda 环境：
+
+<img src="figures/check-pip.png" alt="check-pip" width="600px">
+
+<br>
+<br>
+
+## 4. 安装 PyTorch
+
+PyTorch 可以像安装其他任何 Python 库或包一样使用 pip 安装。例如：
+
+```bash
+pip install torch==2.0.1
+```
+
+但是，由于 PyTorch 是一个全面的库，具有 CPU 和 GPU 兼容的代码，安装可能需要额外的设置和说明（有关更多信息，请参见书中的 *A.1.3 安装 PyTorch*）。
+
+还强烈建议在官方 PyTorch 网站的安装指南菜单中查看更多信息 [https://pytorch.org](https://pytorch.org)。
+
+<img src="figures/pytorch-installer.jpg" width="600px">
+
+
+
+---
+
+
+
+
+有任何问题吗？请随时在 [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions) 中联系我们。
@@ -0,0 +1,111 @@
+# Python 设置提示
+
+
+
+有几种不同的方法可以安装 Python 并设置您的计算环境。在这里，我将介绍我的个人偏好。
+
+（我使用运行 macOS 的计算机，但此工作流程对于运行 Linux 的计算机是类似的，并且可能也适用于其他操作系统。）
+
+<br>
+<br>
+
+## 1. 下载并安装 Miniforge
+
+从 GitHub 仓库 [这里](https://github.com/conda-forge/miniforge) 下载 miniforge。
+
+<img src="figures/download.png" alt="download" width="600px">
+
+根据您的操作系统，这应该会下载一个 `.sh`（macOS，Linux）或 `.exe` 文件（Windows）。
+
+对于 `.sh` 文件，请打开您的命令行终端并执行以下命令
+
+```bash
+sh ~/Desktop/Miniforge3-MacOSX-arm64.sh
+```
+
+其中 `Desktop/` 是 Miniforge 安装程序下载到的文件夹。在您的计算机上，您可能需要用 `Downloads/` 替换它。
+
+<img src="figures/miniforge-install.png" alt="miniforge-install" width="600px">
+
+接下来，按照下载说明步骤进行操作，并按下 "Enter" 确认。
+
+如果您使用许多包，Conda 可能会因为其彻底但复杂的依赖解析过程以及处理大型包索引和元数据而变慢。为了加快 Conda 的速度，您可以使用以下设置，它将切换到更有效的 Rust 重新实现以解决依赖关系：
+
+```
+conda config --set solver libmamba
+```
+
+<br>
+<br>
+
+## 2. 创建一个新的虚拟环境
+
+安装成功后，我建议创建一个名为 `dl-fundamentals` 的新虚拟环境，您可以通过执行以下命令来完成
+
+```bash
+conda create -n LLMs python=3.10
+```
+
+<img src="figures/new-env.png" alt="new-env" width="600px">
+
+> 许多科学计算库不会立即支持最新版本的 Python。因此，在安装 PyTorch 时，建议使用较旧的 Python 版本，即一两个版本。例如，如果最新版本的 Python 是 3.13，则建议使用 Python 3.10 或 3.11。
+
+接下来，激活您的新虚拟环境（每次打开新的终端窗口或选项卡时都必须执行）：
+
+```bash
+conda activate dl-workshop
+```
+
+<img src="figures/activate-env.png" alt="activate-env" width="600px">
+
+<br>
+<br>
+
+## 可选: 美化您的终端
+
+如果您想将终端样式设置为与我的类似，以便您可以看到哪个虚拟环境是活动的，请查看 [Oh My Zsh](https://github.com/ohmyzsh/ohmyzsh) 项目。
+
+<br>
+<br>
+
+## 3. 安装新的 Python 库
+
+
+
+要安装新的 Python 库，您现在可以使用 `conda` 包安装程序。例如，您可以安装 [JupyterLab](https://jupyter.org/install) 和 [watermark](https://github.com/rasbt/watermark) 如下：
+
+```bash
+conda install jupyterlab watermark
+```
+
+<img src="figures/conda-install.png" alt="conda-install" width="600px">
+
+您也仍然可以使用 `pip` 安装库。默认情况下，`pip` 应该已链接到您的新的 `LLms` conda 环境：
+
+<img src="figures/check-pip.png" alt="check-pip" width="600px">
+
+<br>
+<br>
+
+## 4. 安装 PyTorch
+
+PyTorch 可以像安装其他任何 Python 库或包一样使用 pip 安装。例如：
+
+```bash
+pip install torch==2.0.1
+```
+
+但是，由于 PyTorch 是一个全面的库，具有 CPU 和 GPU 兼容的代码，安装可能需要额外的设置和说明（有关更多信息，请参见书中的 *A.1.3 安装 PyTorch*）。
+
+还强烈建议在官方 PyTorch 网站的安装指南菜单中查看更多信息 [https://pytorch.org](https://pytorch.org)。
+
+<img src="figures/pytorch-installer.jpg" width="600px">
+
+
+
+---
+
+
+
+
+有任何问题吗？请随时在 [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions) 中联系我们。
@@ -0,0 +1,69 @@
+# 本书使用的库
+
+本文档提供有关检查已安装的 Python 版本和包的更多信息。（有关安装 Python 和 Python 包的更多信息，请参见 [../01_optional-python-setup-preferences](../01_optional-python-setup-preferences) 文件夹。）
+
+我们在本书中使用了以下主要库。较新版本的这些库可能也是兼容的。但是，如果您在代码中遇到任何问题，可以尝试使用以下库版本作为备用：
+
+-  numpy  1.24.3
+-  scipy 1.10.1
+-  pandas  2.0.2
+-  matplotlib  3.7.1
+-  jupyterlab  4.0
+-  watermark  2.4.2
+-  torch  2.0.1
+-  tiktoken  0.5.1
+
+要最方便地安装这些依赖，您可以使用 `requirements.txt` 文件：
+
+```
+pip install -r requirements.txt
+```
+
+然后，在完成安装后，请使用以下命令检查所有包是否已安装并且是否为最新版本：
+
+```
+python_environment_check.py
+```
+
+<img src="figures/check_1.jpg" width="600px">
+
+还建议在 JupyterLab 中检查版本，方法是在此目录中运行 `jupyter_environment_check.ipynb`，这应该理想地给您与上面相同的结果。
+
+<img src="figures/check_2.jpg" width="500px">
+
+如果您看到以下问题，则可能您的 JupyterLab 实例连接到错误的 conda 环境：
+
+<img src="figures/jupyter-issues.jpg" width="450px">
+
+
+在这种情况下，您可以使用 `watermark` 来检查是否使用 `--conda` 标志在正确的 conda 环境中打开了 JupyterLab 实例：
+
+<img src="figures/watermark.jpg" width="350px">
+
+
+<br>
+<br>
+
+
+## 安装 PyTorch
+
+PyTorch 可以像安装其他任何 Python 库或包一样使用 pip 安装。例如：
+
+```bash
+pip install torch==2.0.1
+```
+
+但是，由于 PyTorch 是一个全面的库，具有 CPU 和 GPU 兼容的代码，安装可能需要额外的设置和说明（有关更多信息，请参见书中的 *A.1.3 安装 PyTorch*）。
+
+同时强烈建议在官方 PyTorch 网站的安装指南菜单中查看更多信息 [https://pytorch.org](https://pytorch.org)。
+
+<img src="figures/pytorch-installer.jpg" width="600px">
+
+
+
+---
+
+
+
+
+有任何问题，请随时在 [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions) 中联系我们。
@@ -0,0 +1,62 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "67f6f7ed-b67d-465b-bf6f-a99b0d996930",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[OK] Your Python version is 3.11.4\n",
+      "[OK] numpy 1.25.2\n",
+      "[OK] scipy 1.11.1\n",
+      "[OK] pandas 2.0.3\n",
+      "[OK] matplotlib 3.7.2\n",
+      "[OK] jupyterlab 4.0.4\n",
+      "[OK] watermark 2.4.3\n",
+      "[OK] torch 2.0.1\n",
+      "[OK] tiktoken 0.5.1\n"
+     ]
+    }
+   ],
+   "source": [
+    "from python_environment_check import check_packages, get_requirements_dict\n",
+    "\n",
+    "d = get_requirements_dict()\n",
+    "check_packages(d)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d5ca05fc-98e0-4bba-a95e-350e1764a12c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,67 @@
+from os.path import dirname, join, realpath
+from packaging.version import parse as version_parse
+import platform
+import sys
+
+if version_parse(platform.python_version()) < version_parse('3.9'):
+    print('[FAIL] We recommend Python 3.9 or newer but'
+          ' found version %s' % (sys.version))
+else:
+    print('[OK] Your Python version is %s' % (platform.python_version()))
+
+
+def get_packages(pkgs):
+    versions = []
+    for p in pkgs:
+        try:
+            imported = __import__(p)
+            try:
+                versions.append(imported.__version__)
+            except AttributeError:
+                try:
+                    versions.append(imported.version)
+                except AttributeError:
+                    try:
+                        versions.append(imported.version_info)
+                    except:
+                        try:
+                            import importlib, importlib_metadata
+                            imported = importlib.import_module(p)
+                            version = importlib_metadata.version(p)
+                            versions.append(version)
+                        except ImportError:
+                            version = "not installed"
+                            versions.append('0.0')
+        except ImportError:
+            print(f'[FAIL]: {p} is not installed and/or cannot be imported.')
+            versions.append('N/A')
+    return versions
+
+
+def get_requirements_dict():
+    PROJECT_ROOT = dirname(realpath(__file__))
+    REQUIREMENTS_FILE = join(PROJECT_ROOT, "requirements.txt")
+    d = {}
+    with open(REQUIREMENTS_FILE) as f:
+        for line in f:
+            line = line.split(" ")
+            d[line[0]] = line[-1]
+    return d
+
+
+def check_packages(d):
+    versions = get_packages(d.keys())
+
+    for (pkg_name, suggested_ver), actual_ver in zip(d.items(), versions):
+        if actual_ver == 'N/A':
+            continue
+        actual_ver, suggested_ver = version_parse(actual_ver), version_parse(suggested_ver)
+        if actual_ver < suggested_ver:
+            print(f'[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {suggested_ver}')
+        else:
+            print(f'[OK] {pkg_name} {actual_ver}')
+
+
+if __name__ == '__main__':
+    d = get_requirements_dict()
+    check_packages(d)
@@ -0,0 +1,8 @@
+numpy >= 1.24.3
+scipy >= 1.10.1
+pandas >= 2.0.2
+matplotlib >= 3.7.1
+jupyterlab >= 4.0
+watermark >= 2.4.2
+torch >= 2.0.1
+tiktoken >= 0.5.1
@@ -0,0 +1,69 @@
+# 本书使用的库
+
+本文档提供有关检查已安装的 Python 版本和包的更多信息。（有关安装 Python 和 Python 包的更多信息，请参见 [../01_optional-python-setup-preferences](../01_optional-python-setup-preferences) 文件夹。）
+
+我们在本书中使用了以下主要库。较新版本的这些库可能也是兼容的。但是，如果您在代码中遇到任何问题，可以尝试使用以下库版本作为备用：
+
+-  numpy  1.24.3
+-  scipy 1.10.1
+-  pandas  2.0.2
+-  matplotlib  3.7.1
+-  jupyterlab  4.0
+-  watermark  2.4.2
+-  torch  2.0.1
+-  tiktoken  0.5.1
+
+要最方便地安装这些依赖，您可以使用 `requirements.txt` 文件：
+
+```
+pip install -r requirements.txt
+```
+
+然后，在完成安装后，请使用以下命令检查所有包是否已安装并且是否为最新版本：
+
+```
+python_environment_check.py
+```
+
+<img src="figures/check_1.jpg" width="600px">
+
+还建议在 JupyterLab 中检查版本，方法是在此目录中运行 `jupyter_environment_check.ipynb`，这应该理想地给您与上面相同的结果。
+
+<img src="figures/check_2.jpg" width="500px">
+
+如果您看到以下问题，则可能您的 JupyterLab 实例连接到错误的 conda 环境：
+
+<img src="figures/jupyter-issues.jpg" width="450px">
+
+
+在这种情况下，您可以使用 `watermark` 来检查是否使用 `--conda` 标志在正确的 conda 环境中打开了 JupyterLab 实例：
+
+<img src="figures/watermark.jpg" width="350px">
+
+
+<br>
+<br>
+
+
+## 安装 PyTorch
+
+PyTorch 可以像安装其他任何 Python 库或包一样使用 pip 安装。例如：
+
+```bash
+pip install torch==2.0.1
+```
+
+但是，由于 PyTorch 是一个全面的库，具有 CPU 和 GPU 兼容的代码，安装可能需要额外的设置和说明（有关更多信息，请参见书中的 *A.1.3 安装 PyTorch*）。
+
+同时强烈建议在官方 PyTorch 网站的安装指南菜单中查看更多信息 [https://pytorch.org](https://pytorch.org)。
+
+<img src="figures/pytorch-installer.jpg" width="600px">
+
+
+
+---
+
+
+
+
+有任何问题，请随时在 [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions) 中联系我们。
@@ -0,0 +1,62 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "67f6f7ed-b67d-465b-bf6f-a99b0d996930",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[OK] Your Python version is 3.11.4\n",
+      "[OK] numpy 1.25.2\n",
+      "[OK] scipy 1.11.1\n",
+      "[OK] pandas 2.0.3\n",
+      "[OK] matplotlib 3.7.2\n",
+      "[OK] jupyterlab 4.0.4\n",
+      "[OK] watermark 2.4.3\n",
+      "[OK] torch 2.0.1\n",
+      "[OK] tiktoken 0.5.1\n"
+     ]
+    }
+   ],
+   "source": [
+    "from python_environment_check import check_packages, get_requirements_dict\n",
+    "\n",
+    "d = get_requirements_dict()\n",
+    "check_packages(d)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d5ca05fc-98e0-4bba-a95e-350e1764a12c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,67 @@
+from os.path import dirname, join, realpath
+from packaging.version import parse as version_parse
+import platform
+import sys
+
+if version_parse(platform.python_version()) < version_parse('3.9'):
+    print('[FAIL] We recommend Python 3.9 or newer but'
+          ' found version %s' % (sys.version))
+else:
+    print('[OK] Your Python version is %s' % (platform.python_version()))
+
+
+def get_packages(pkgs):
+    versions = []
+    for p in pkgs:
+        try:
+            imported = __import__(p)
+            try:
+                versions.append(imported.__version__)
+            except AttributeError:
+                try:
+                    versions.append(imported.version)
+                except AttributeError:
+                    try:
+                        versions.append(imported.version_info)
+                    except:
+                        try:
+                            import importlib, importlib_metadata
+                            imported = importlib.import_module(p)
+                            version = importlib_metadata.version(p)
+                            versions.append(version)
+                        except ImportError:
+                            version = "not installed"
+                            versions.append('0.0')
+        except ImportError:
+            print(f'[FAIL]: {p} is not installed and/or cannot be imported.')
+            versions.append('N/A')
+    return versions
+
+
+def get_requirements_dict():
+    PROJECT_ROOT = dirname(realpath(__file__))
+    REQUIREMENTS_FILE = join(PROJECT_ROOT, "requirements.txt")
+    d = {}
+    with open(REQUIREMENTS_FILE) as f:
+        for line in f:
+            line = line.split(" ")
+            d[line[0]] = line[-1]
+    return d
+
+
+def check_packages(d):
+    versions = get_packages(d.keys())
+
+    for (pkg_name, suggested_ver), actual_ver in zip(d.items(), versions):
+        if actual_ver == 'N/A':
+            continue
+        actual_ver, suggested_ver = version_parse(actual_ver), version_parse(suggested_ver)
+        if actual_ver < suggested_ver:
+            print(f'[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {suggested_ver}')
+        else:
+            print(f'[OK] {pkg_name} {actual_ver}')
+
+
+if __name__ == '__main__':
+    d = get_requirements_dict()
+    check_packages(d)
@@ -0,0 +1,8 @@
+numpy >= 1.24.3
+scipy >= 1.10.1
+pandas >= 2.0.2
+matplotlib >= 3.7.1
+jupyterlab >= 4.0
+watermark >= 2.4.2
+torch >= 2.0.1
+tiktoken >= 0.5.1
@@ -0,0 +1,178 @@
+# Appendix A: Introduction to PyTorch (Part 3)
+
+import torch
+import torch.nn.functional as F
+from torch.utils.data import Dataset, DataLoader
+
+# NEW imports:
+import os
+import torch.multiprocessing as mp
+from torch.utils.data.distributed import DistributedSampler
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed import init_process_group, destroy_process_group
+
+
+# NEW: function to initialize a distributed process group (1 process / GPU)
+# this allows communication among processes
+def ddp_setup(rank, world_size):
+    """
+    Arguments:
+        rank: a unique process ID
+        world_size: total number of processes in the group
+    """
+    # rank of machine running rank:0 process
+    # here, we assume all GPUs are on the same machine
+    os.environ["MASTER_ADDR"] = "localhost"
+    # any free port on the machine
+    os.environ["MASTER_PORT"] = "12345"
+
+    # initialize process group
+    # Windows users may have to use "gloo" instead of "nccl" as backend
+    # nccl: NVIDIA Collective Communication Library
+    init_process_group(backend="nccl", rank=rank, world_size=world_size)
+    torch.cuda.set_device(rank)
+
+
+class ToyDataset(Dataset):
+    def __init__(self, X, y):
+        self.features = X
+        self.labels = y
+
+    def __getitem__(self, index):
+        one_x = self.features[index]
+        one_y = self.labels[index]
+        return one_x, one_y
+
+    def __len__(self):
+        return self.labels.shape[0]
+
+
+class NeuralNetwork(torch.nn.Module):
+    def __init__(self, num_inputs, num_outputs):
+        super().__init__()
+
+        self.layers = torch.nn.Sequential(
+            # 1st hidden layer
+            torch.nn.Linear(num_inputs, 30),
+            torch.nn.ReLU(),
+
+            # 2nd hidden layer
+            torch.nn.Linear(30, 20),
+            torch.nn.ReLU(),
+
+            # output layer
+            torch.nn.Linear(20, num_outputs),
+        )
+
+    def forward(self, x):
+        logits = self.layers(x)
+        return logits
+
+
+def prepare_dataset():
+    X_train = torch.tensor([
+        [-1.2, 3.1],
+        [-0.9, 2.9],
+        [-0.5, 2.6],
+        [2.3, -1.1],
+        [2.7, -1.5]
+    ])
+    y_train = torch.tensor([0, 0, 0, 1, 1])
+
+    X_test = torch.tensor([
+        [-0.8, 2.8],
+        [2.6, -1.6],
+    ])
+    y_test = torch.tensor([0, 1])
+
+    train_ds = ToyDataset(X_train, y_train)
+    test_ds = ToyDataset(X_test, y_test)
+
+    train_loader = DataLoader(
+        dataset=train_ds,
+        batch_size=2,
+        shuffle=False, # NEW: False because of DistributedSampler below
+        pin_memory=True,
+        drop_last=True,
+        # NEW: chunk batches across GPUs without overlapping samples:
+        sampler=DistributedSampler(train_ds) # NEW
+    )
+    test_loader = DataLoader(
+        dataset=test_ds,
+        batch_size=2,
+        shuffle=False,
+    )
+    return train_loader, test_loader
+
+
+# NEW: wrapper
+def main(rank, world_size, num_epochs):
+
+    ddp_setup(rank, world_size) # NEW: initialize process groups
+
+    train_loader, test_loader = prepare_dataset()
+    model = NeuralNetwork(num_inputs=2, num_outputs=2)
+    model.to(rank)
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
+
+    model = DDP(model, device_ids=[rank]) # NEW: wrap model with DDP
+    # the core model is now accessible as model.module
+    
+    for epoch in range(num_epochs):
+    
+        model.train()
+        for features, labels in enumerate(train_loader):
+    
+            features, labels = features.to(rank), labels.to(rank) # New: use rank
+            logits = model(features)
+            loss = F.cross_entropy(logits, labels) # Loss function
+    
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+    
+            ### LOGGING
+            print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}"
+                  f" | Batchsize {labels.shape[0]:03d}"
+                  f" | Train/Val Loss: {loss:.2f}")
+    
+    model.eval()
+    train_acc = compute_accuracy(model, train_loader, device=rank)
+    print(f"[GPU{rank}] Training accuracy", train_acc)
+    test_acc = compute_accuracy(model, test_loader, device=rank)
+    print(f"[GPU{rank}] Test accuracy", test_acc)
+
+    destroy_process_group() # NEW: cleanly exit distributed mode
+
+
+def compute_accuracy(model, dataloader, device):
+    model = model.eval()
+    correct = 0.0
+    total_examples = 0
+
+    for idx, (features, labels) in enumerate(dataloader):
+        features, labels = features.to(device), labels.to(device)
+
+        with torch.no_grad():
+            logits = model(features)
+        predictions = torch.argmax(logits, dim=1)
+        compare = labels == predictions
+        correct += torch.sum(compare)
+        total_examples += len(compare)
+    return (correct / total_examples).item()
+
+
+if __name__ == "__main__":
+    print("PyTorch version:", torch.__version__)
+    print("CUDA available:", torch.cuda.is_available())
+    print("Number of GPUs available:", torch.cuda.device_count())
+
+    torch.manual_seed(123)
+
+    # NEW: spawn new processes
+    # note that spawn will automatically pass the rank
+    num_epochs = 3
+    world_size = torch.cuda.device_count()
+    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
+    # nprocs=world_size spawns one process per GPU
+
@@ -0,0 +1,452 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "O9i6kzBsZVaZ"
+   },
+   "source": [
+    "# Appendix A: Introduction to PyTorch (Part 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ppbG5d-NZezH"
+   },
+   "source": [
+    "## A.9 Optimizing training performance with GPUs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6jH0J_DPZhbn"
+   },
+   "source": [
+    "### A.9.1 PyTorch computations on GPU devices"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "RM7kGhwMF_nO",
+    "outputId": "ac60b048-b81f-4bb0-90fa-1ca474f04e9a"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2.0.1+cu118\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "\n",
+    "print(torch.__version__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "OXLCKXhiUkZt",
+    "outputId": "39fe5366-287e-47eb-cc34-3508d616c4f9"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "True\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(torch.cuda.is_available())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "MTTlfh53Va-T",
+    "outputId": "f31d8bbe-577f-4db4-9939-02e66b9f96d1"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([5., 7., 9.])"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tensor_1 = torch.tensor([1., 2., 3.])\n",
+    "tensor_2 = torch.tensor([4., 5., 6.])\n",
+    "\n",
+    "print(tensor_1 + tensor_2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "Z4LwTNw7Vmmb",
+    "outputId": "1c025c6a-e3ed-4c7c-f5fd-86c14607036e"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([5., 7., 9.], device='cuda:0')\n"
+     ]
+    }
+   ],
+   "source": [
+    "tensor_1 = tensor_1.to(\"cuda\")\n",
+    "tensor_2 = tensor_2.to(\"cuda\")\n",
+    "\n",
+    "print(tensor_1 + tensor_2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 184
+    },
+    "id": "tKT6URN1Vuft",
+    "outputId": "e6f01e7f-d9cf-44cb-cc6d-46fc7907d5c0"
+   },
+   "outputs": [
+    {
+     "ename": "RuntimeError",
+     "evalue": "ignored",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-7-4ff3c4d20fc3>\u001b[0m in \u001b[0;36m<cell line: 2>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mtensor_1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtensor_1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"cpu\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtensor_1\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mtensor_2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;31mRuntimeError\u001b[0m: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"
+     ]
+    }
+   ],
+   "source": [
+    "tensor_1 = tensor_1.to(\"cpu\")\n",
+    "print(tensor_1 + tensor_2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "c8j1cWDcWAMf"
+   },
+   "source": [
+    "## A.9.2 Single-GPU training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "id": "GyY59cjieitv"
+   },
+   "outputs": [],
+   "source": [
+    "X_train = torch.tensor([\n",
+    "    [-1.2, 3.1],\n",
+    "    [-0.9, 2.9],\n",
+    "    [-0.5, 2.6],\n",
+    "    [2.3, -1.1],\n",
+    "    [2.7, -1.5]\n",
+    "])\n",
+    "\n",
+    "y_train = torch.tensor([0, 0, 0, 1, 1])\n",
+    "\n",
+    "X_test = torch.tensor([\n",
+    "    [-0.8, 2.8],\n",
+    "    [2.6, -1.6],\n",
+    "])\n",
+    "\n",
+    "y_test = torch.tensor([0, 1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "id": "v41gKqEJempa"
+   },
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset\n",
+    "\n",
+    "\n",
+    "class ToyDataset(Dataset):\n",
+    "    def __init__(self, X, y):\n",
+    "        self.features = X\n",
+    "        self.labels = y\n",
+    "\n",
+    "    def __getitem__(self, index):\n",
+    "        one_x = self.features[index]\n",
+    "        one_y = self.labels[index]\n",
+    "        return one_x, one_y\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return self.labels.shape[0]\n",
+    "\n",
+    "train_ds = ToyDataset(X_train, y_train)\n",
+    "test_ds = ToyDataset(X_test, y_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "id": "UPGVRuylep8Y"
+   },
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "train_loader = DataLoader(\n",
+    "    dataset=train_ds,\n",
+    "    batch_size=2,\n",
+    "    shuffle=True,\n",
+    "    num_workers=1,\n",
+    "    drop_last=True\n",
+    ")\n",
+    "\n",
+    "test_loader = DataLoader(\n",
+    "    dataset=test_ds,\n",
+    "    batch_size=2,\n",
+    "    shuffle=False,\n",
+    "    num_workers=1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "id": "drhg6IXofAXh"
+   },
+   "outputs": [],
+   "source": [
+    "class NeuralNetwork(torch.nn.Module):\n",
+    "    def __init__(self, num_inputs, num_outputs):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        self.layers = torch.nn.Sequential(\n",
+    "\n",
+    "            # 1st hidden layer\n",
+    "            torch.nn.Linear(num_inputs, 30),\n",
+    "            torch.nn.ReLU(),\n",
+    "\n",
+    "            # 2nd hidden layer\n",
+    "            torch.nn.Linear(30, 20),\n",
+    "            torch.nn.ReLU(),\n",
+    "\n",
+    "            # output layer\n",
+    "            torch.nn.Linear(20, num_outputs),\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        logits = self.layers(x)\n",
+    "        return logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "7jaS5sqPWCY0",
+    "outputId": "84c74615-38f2-48b8-eeda-b5912fed1d3a"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75\n",
+      "Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65\n",
+      "Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44\n",
+      "Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13\n",
+      "Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03\n",
+      "Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch.nn.functional as F\n",
+    "\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "model = NeuralNetwork(num_inputs=2, num_outputs=2)\n",
+    "\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # NEW\n",
+    "model = model.to(device) # NEW\n",
+    "\n",
+    "optimizer = torch.optim.SGD(model.parameters(), lr=0.5)\n",
+    "\n",
+    "num_epochs = 3\n",
+    "\n",
+    "for epoch in range(num_epochs):\n",
+    "\n",
+    "    model.train()\n",
+    "    for batch_idx, (features, labels) in enumerate(train_loader):\n",
+    "\n",
+    "        features, labels = features.to(device), labels.to(device) # NEW\n",
+    "        logits = model(features)\n",
+    "        loss = F.cross_entropy(logits, labels) # Loss function\n",
+    "\n",
+    "        optimizer.zero_grad()\n",
+    "        loss.backward()\n",
+    "        optimizer.step()\n",
+    "\n",
+    "        ### LOGGING\n",
+    "        print(f\"Epoch: {epoch+1:03d}/{num_epochs:03d}\"\n",
+    "              f\" | Batch {batch_idx:03d}/{len(train_loader):03d}\"\n",
+    "              f\" | Train/Val Loss: {loss:.2f}\")\n",
+    "\n",
+    "    model.eval()\n",
+    "    # Optional model evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {
+    "id": "4qrlmnPPe7FO"
+   },
+   "outputs": [],
+   "source": [
+    "def compute_accuracy(model, dataloader, device):\n",
+    "\n",
+    "    model = model.eval()\n",
+    "    correct = 0.0\n",
+    "    total_examples = 0\n",
+    "\n",
+    "    for idx, (features, labels) in enumerate(dataloader):\n",
+    "\n",
+    "        features, labels = features.to(device), labels.to(device) # New\n",
+    "\n",
+    "        with torch.no_grad():\n",
+    "            logits = model(features)\n",
+    "\n",
+    "        predictions = torch.argmax(logits, dim=1)\n",
+    "        compare = labels == predictions\n",
+    "        correct += torch.sum(compare)\n",
+    "        total_examples += len(compare)\n",
+    "\n",
+    "    return (correct / total_examples).item()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "1_-BfkfEf4HX",
+    "outputId": "473bf21d-5880-4de3-fc8a-051d75315b94"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1.0"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "compute_accuracy(model, train_loader, device=device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "iYtXKBGEgKss",
+    "outputId": "508edd84-3fb7-4d04-cb23-9df0c3d24170"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1.0"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "compute_accuracy(model, test_loader, device=device)"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,176 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise A.3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "class NeuralNetwork(torch.nn.Module):\n",
+    "    def __init__(self, num_inputs, num_outputs):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        self.layers = torch.nn.Sequential(\n",
+    "                \n",
+    "            # 1st hidden layer\n",
+    "            torch.nn.Linear(num_inputs, 30),\n",
+    "            torch.nn.ReLU(),\n",
+    "\n",
+    "            # 2nd hidden layer\n",
+    "            torch.nn.Linear(30, 20),\n",
+    "            torch.nn.ReLU(),\n",
+    "\n",
+    "            # output layer\n",
+    "            torch.nn.Linear(20, num_outputs),\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        logits = self.layers(x)\n",
+    "        return logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of trainable model parameters: 752\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = NeuralNetwork(2, 2)\n",
+    "\n",
+    "num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "print(\"Total number of trainable model parameters:\", num_params)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise A.4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "id": "qGgnamiyLJxp"
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "a = torch.rand(100, 200)\n",
+    "b = torch.rand(200, 300)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "CvGvIeVkLzXE",
+    "outputId": "44d027be-0787-4348-9c06-4e559d94d0e1"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "63.8 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit a @ b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "id": "OmRtZLa9L2ZG"
+   },
+   "outputs": [],
+   "source": [
+    "a, b = a.to(\"cuda\"), b.to(\"cuda\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "duLEhXDPL6k0",
+    "outputId": "3486471d-fd62-446f-9855-2d01f41fd101"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "13.8 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit a @ b"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Zqqa-To2L749"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "V100",
+   "machine_shape": "hm",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,3 @@
+# Chapter 1: Understanding Large Language Models
+
+There is no code in this chapter.
@@ -0,0 +1,3 @@
+# Chapter 1: Understanding Large Language Models
+
+There is no code in this chapter.
@@ -0,0 +1,7 @@
+# Chapter 2: Working with Text Data
+
+- [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions
+  
+- [02_bonus_bytepair-encoder](02_bonus_bytepair-encoder) contains optional code to benchmark different byte pair encoder implementations
+  
+- [03_bonus_embedding-vs-matmul](03_bonus_embedding-vs-matmul) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
@@ -0,0 +1,5 @@
+# Chapter 2: Working with Text Data
+
+- [ch02.ipynb](ch02.ipynb) contains all the code as it appears in the chapter
+- [dataloader.ipynb](dataloader.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
+
@@ -0,0 +1,179 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# The Main Data Loading Pipeline Summarized"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "070000fc-a7b7-4c56-a2c0-a938d413a790",
+   "metadata": {},
+   "source": [
+    "The complete chapter code is located in [ch02.ipynb](./ch02.ipynb).\n",
+    "\n",
+    "This notebook contains the main takeaway, the data loading pipeline without the intermediate steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "93804da5-372b-45ff-9ef4-8398ba1dd78e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch version: 2.1.0.post301\n",
+      "tiktoken version: 0.5.2\n"
+     ]
+    }
+   ],
+   "source": [
+    "from importlib.metadata import version\n",
+    "\n",
+    "import tiktoken\n",
+    "import torch\n",
+    "\n",
+    "print(\"torch version:\", version(\"torch\"))\n",
+    "print(\"tiktoken version:\", version(\"tiktoken\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "\n",
+    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "block_size = 1024\n",
+    "\n",
+    "\n",
+    "token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "\n",
+    "max_length = 4\n",
+    "dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "\n",
+    "    token_embeddings = token_embedding_layer(x)\n",
+    "    pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
+    "\n",
+    "    input_embeddings = token_embeddings + pos_embeddings\n",
+    "\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(input_embeddings.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34ebdac9-a3ff-4135-8a0f-3ac8ac21af75",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,339 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ab88d307-61ba-45ef-89bc-e3569443dfca",
+   "metadata": {},
+   "source": [
+    "# Chapter 2 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# Exercise 2.1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7614337f-f639-42c9-a99b-d33f74fa8a03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[33901]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"Ak\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[86]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"w\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "2773c09d-c136-4372-a2be-04b58d292842",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[343]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"ir\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8a6abd32-1e0a-4038-9dd2-673f47bcdeb5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[86]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"w\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "26ae940a-9841-4e27-a1df-b83fc8a488b3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[220]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\" \")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a606c39a-6747-4cd8-bb38-e3183f80908d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[959]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"ier\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "47c7268d-8fdc-4957-bc68-5be6113f45a7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Akwirw ier'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.decode([33901, 86, 343, 86, 220, 959])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29e5034a-95ed-46d8-9972-589354dc9fd4",
+   "metadata": {},
+   "source": [
+    "# Exercise 2.2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "4d50af16-937b-49e0-8ffd-42d30cbb41c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt)\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader(txt, batch_size=4, max_length=256, stride=128):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "max_len = 4\n",
+    "block_size = max_len\n",
+    "\n",
+    "token_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "0128eefa-d7c8-4f76-9851-566dfa7c3745",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[  40,  367],\n",
+       "        [2885, 1464],\n",
+       "        [1807, 3619],\n",
+       "        [ 402,  271]])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataloader = create_dataloader(raw_text, batch_size=4, max_length=2, stride=2)\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "    break\n",
+    "\n",
+    "x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "ff5c1e90-c6de-4a87-adf6-7e19f603291c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],\n",
+       "        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],\n",
+       "        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],\n",
+       "        [  402,   271, 10899,  2138,   257,  7026, 15632,   438]])"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataloader = create_dataloader(raw_text, batch_size=4, max_length=8, stride=2)\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "    break\n",
+    "\n",
+    "x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0e9e342-1499-41ab-bd65-8117f3615fa2",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,5 @@
+# Chapter 2: Working with Text Data
+
+- [ch02.ipynb](ch02.ipynb) contains all the code as it appears in the chapter
+- [dataloader.ipynb](dataloader.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
+
@@ -0,0 +1,179 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# The Main Data Loading Pipeline Summarized"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "070000fc-a7b7-4c56-a2c0-a938d413a790",
+   "metadata": {},
+   "source": [
+    "The complete chapter code is located in [ch02.ipynb](./ch02.ipynb).\n",
+    "\n",
+    "This notebook contains the main takeaway, the data loading pipeline without the intermediate steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "93804da5-372b-45ff-9ef4-8398ba1dd78e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch version: 2.1.0.post301\n",
+      "tiktoken version: 0.5.2\n"
+     ]
+    }
+   ],
+   "source": [
+    "from importlib.metadata import version\n",
+    "\n",
+    "import tiktoken\n",
+    "import torch\n",
+    "\n",
+    "print(\"torch version:\", version(\"torch\"))\n",
+    "print(\"tiktoken version:\", version(\"tiktoken\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "\n",
+    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "block_size = 1024\n",
+    "\n",
+    "\n",
+    "token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "\n",
+    "max_length = 4\n",
+    "dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "\n",
+    "    token_embeddings = token_embedding_layer(x)\n",
+    "    pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
+    "\n",
+    "    input_embeddings = token_embeddings + pos_embeddings\n",
+    "\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(input_embeddings.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34ebdac9-a3ff-4135-8a0f-3ac8ac21af75",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,339 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ab88d307-61ba-45ef-89bc-e3569443dfca",
+   "metadata": {},
+   "source": [
+    "# Chapter 2 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# Exercise 2.1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7614337f-f639-42c9-a99b-d33f74fa8a03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[33901]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"Ak\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[86]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"w\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "2773c09d-c136-4372-a2be-04b58d292842",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[343]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"ir\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8a6abd32-1e0a-4038-9dd2-673f47bcdeb5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[86]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"w\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "26ae940a-9841-4e27-a1df-b83fc8a488b3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[220]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\" \")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a606c39a-6747-4cd8-bb38-e3183f80908d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[959]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.encode(\"ier\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "47c7268d-8fdc-4957-bc68-5be6113f45a7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Akwirw ier'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.decode([33901, 86, 343, 86, 220, 959])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29e5034a-95ed-46d8-9972-589354dc9fd4",
+   "metadata": {},
+   "source": [
+    "# Exercise 2.2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "4d50af16-937b-49e0-8ffd-42d30cbb41c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt)\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader(txt, batch_size=4, max_length=256, stride=128):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "max_len = 4\n",
+    "block_size = max_len\n",
+    "\n",
+    "token_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "0128eefa-d7c8-4f76-9851-566dfa7c3745",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[  40,  367],\n",
+       "        [2885, 1464],\n",
+       "        [1807, 3619],\n",
+       "        [ 402,  271]])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataloader = create_dataloader(raw_text, batch_size=4, max_length=2, stride=2)\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "    break\n",
+    "\n",
+    "x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "ff5c1e90-c6de-4a87-adf6-7e19f603291c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],\n",
+       "        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],\n",
+       "        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],\n",
+       "        [  402,   271, 10899,  2138,   257,  7026, 15632,   438]])"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataloader = create_dataloader(raw_text, batch_size=4, max_length=8, stride=2)\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "    break\n",
+    "\n",
+    "x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0e9e342-1499-41ab-bd65-8117f3615fa2",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,165 @@
+I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)
+
+"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?
+
+Well!--even through the prism of Hermia's tears I felt able to face the fact with equanimity. Poor Jack Gisburn! The women had made him--it was fitting that they should mourn him. Among his own sex fewer regrets were heard, and in his own trade hardly a murmur. Professional jealousy? Perhaps. If it were, the honour of the craft was vindicated by little Claude Nutley, who, in all good faith, brought out in the Burlington a very handsome "obituary" on Jack--one of those showy articles stocked with random technicalities that I have heard (I won't say by whom) compared to Gisburn's painting. And so--his resolve being apparently irrevocable--the discussion gradually died out, and, as Mrs. Thwing had predicted, the price of "Gisburns" went up.
+
+It was not till three years later that, in the course of a few weeks' idling on the Riviera, it suddenly occurred to me to wonder why Gisburn had given up his painting. On reflection, it really was a tempting problem. To accuse his wife would have been too easy--his fair sitters had been denied the solace of saying that Mrs. Gisburn had "dragged him down." For Mrs. Gisburn--as such--had not existed till nearly a year after Jack's resolve had been taken. It might be that he had married her--since he liked his ease--because he didn't want to go on painting; but it would have been hard to prove that he had given up his painting because he had married her.
+
+Of course, if she had not dragged him down, she had equally, as Miss Croft contended, failed to "lift him up"--she had not led him back to the easel. To put the brush into his hand again--what a vocation for a wife! But Mrs. Gisburn appeared to have disdained it--and I felt it might be interesting to find out why.
+
+The desultory life of the Riviera lends itself to such purely academic speculations; and having, on my way to Monte Carlo, caught a glimpse of Jack's balustraded terraces between the pines, I had myself borne thither the next day.
+
+I found the couple at tea beneath their palm-trees; and Mrs. Gisburn's welcome was so genial that, in the ensuing weeks, I claimed it frequently. It was not that my hostess was "interesting": on that point I could have given Miss Croft the fullest reassurance. It was just because she was _not_ interesting--if I may be pardoned the bull--that I found her so. For Jack, all his life, had been surrounded by interesting women: they had fostered his art, it had been reared in the hot-house of their adulation. And it was therefore instructive to note what effect the "deadening atmosphere of mediocrity" (I quote Miss Croft) was having on him.
+
+I have mentioned that Mrs. Gisburn was rich; and it was immediately perceptible that her husband was extracting from this circumstance a delicate but substantial satisfaction. It is, as a rule, the people who scorn money who get most out of it; and Jack's elegant disdain of his wife's big balance enabled him, with an appearance of perfect good-breeding, to transmute it into objects of art and luxury. To the latter, I must add, he remained relatively indifferent; but he was buying Renaissance bronzes and eighteenth-century pictures with a discrimination that bespoke the amplest resources.
+
+"Money's only excuse is to put beauty into circulation," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gisburn, beaming on him, added for my enlightenment: "Jack is so morbidly sensitive to every form of beauty."
+
+Poor Jack! It had always been his fate to have women say such things of him: the fact should be set down in extenuation. What struck me now was that, for the first time, he resented the tone. I had seen him, so often, basking under similar tributes--was it the conjugal note that robbed them of their savour? No--for, oddly enough, it became apparent that he was fond of Mrs. Gisburn--fond enough not to see her absurdity. It was his own absurdity he seemed to be wincing under--his own attitude as an object for garlands and incense.
+
+"My dear, since I've chucked painting people don't say that stuff about me--they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace.
+
+I glanced after him, struck by his last word. Victor Grindle was, in fact, becoming the man of the moment--as Jack himself, one might put it, had been the man of the hour. The younger artist was said to have formed himself at my friend's feet, and I wondered if a tinge of jealousy underlay the latter's mysterious abdication. But no--for it was not till after that event that the _rose Dubarry_ drawing-rooms had begun to display their "Grindles."
+
+I turned to Mrs. Gisburn, who had lingered to give a lump of sugar to her spaniel in the dining-room.
+
+"Why _has_ he chucked painting?" I asked abruptly.
+
+She raised her eyebrows with a hint of good-humoured surprise.
+
+"Oh, he doesn't _have_ to now, you know; and I want him to enjoy himself," she said quite simply.
+
+I looked about the spacious white-panelled room, with its _famille-verte_ vases repeating the tones of the pale damask curtains, and its eighteenth-century pastels in delicate faded frames.
+
+"Has he chucked his pictures too? I haven't seen a single one in the house."
+
+A slight shade of constraint crossed Mrs. Gisburn's open countenance. "It's his ridiculous modesty, you know. He says they're not fit to have about; he's sent them all away except one--my portrait--and that I have to keep upstairs."
+
+His ridiculous modesty--Jack's modesty about his pictures? My curiosity was growing like the bean-stalk. I said persuasively to my hostess: "I must really see your portrait, you know."
+
+She glanced out almost timorously at the terrace where her husband, lounging in a hooded chair, had lit a cigar and drawn the Russian deerhound's head between his knees.
+
+"Well, come while he's not looking," she said, with a laugh that tried to hide her nervousness; and I followed her between the marble Emperors of the hall, and up the wide stairs with terra-cotta nymphs poised among flowers at each landing.
+
+In the dimmest corner of her boudoir, amid a profusion of delicate and distinguished objects, hung one of the familiar oval canvases, in the inevitable garlanded frame. The mere outline of the frame called up all Gisburn's past!
+
+Mrs. Gisburn drew back the window-curtains, moved aside a _jardiniere_ full of pink azaleas, pushed an arm-chair away, and said: "If you stand here you can just manage to see it. I had it over the mantel-piece, but he wouldn't let it stay."
+
+Yes--I could just manage to see it--the first portrait of Jack's I had ever had to strain my eyes over! Usually they had the place of honour--say the central panel in a pale yellow or _rose Dubarry_ drawing-room, or a monumental easel placed so that it took the light through curtains of old Venetian point. The more modest place became the picture better; yet, as my eyes grew accustomed to the half-light, all the characteristic qualities came out--all the hesitations disguised as audacities, the tricks of prestidigitation by which, with such consummate skill, he managed to divert attention from the real business of the picture to some pretty irrelevance of detail. Mrs. Gisburn, presenting a neutral surface to work on--forming, as it were, so inevitably the background of her own picture--had lent herself in an unusual degree to the display of this false virtuosity. The picture was one of Jack's "strongest," as his admirers would have put it--it represented, on his part, a swelling of muscles, a congesting of veins, a balancing, straddling and straining, that reminded one of the circus-clown's ironic efforts to lift a feather. It met, in short, at every point the demand of lovely woman to be painted "strongly" because she was tired of being painted "sweetly"--and yet not to lose an atom of the sweetness.
+
+"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it."
+
+"Destroyed it?" I was about to follow up this clue when I heard a footstep and saw Jack himself on the threshold.
+
+As he stood there, his hands in the pockets of his velveteen coat, the thin brown waves of hair pushed back from his white forehead, his lean sunburnt cheeks furrowed by a smile that lifted the tips of a self-confident moustache, I felt to what a degree he had the same quality as his pictures--the quality of looking cleverer than he was.
+
+His wife glanced at him deprecatingly, but his eyes travelled past her to the portrait.
+
+"Mr. Rickham wanted to see it," she began, as if excusing herself. He shrugged his shoulders, still smiling.
+
+"Oh, Rickham found me out long ago," he said lightly; then, passing his arm through mine: "Come and see the rest of the house."
+
+He showed it to me with a kind of naive suburban pride: the bath-rooms, the speaking-tubes, the dress-closets, the trouser-presses--all the complex simplifications of the millionaire's domestic economy. And whenever my wonder paid the expected tribute he said, throwing out his chest a little: "Yes, I really don't see how people manage to live without that."
+
+Well--it was just the end one might have foreseen for him. Only he was, through it all and in spite of it all--as he had been through, and in spite of, his pictures--so handsome, so charming, so disarming, that one longed to cry out: "Be dissatisfied with your leisure!" as once one had longed to say: "Be dissatisfied with your work!"
+
+But, with the cry on my lips, my diagnosis suffered an unexpected check.
+
+"This is my own lair," he said, leading me into a dark plain room at the end of the florid vista. It was square and brown and leathery: no "effects"; no bric-a-brac, none of the air of posing for reproduction in a picture weekly--above all, no least sign of ever having been used as a studio.
+
+The fact brought home to me the absolute finality of Jack's break with his old life.
+
+"Don't you ever dabble with paint any more?" I asked, still looking about for a trace of such activity.
+
+"Never," he said briefly.
+
+"Or water-colour--or etching?"
+
+His confident eyes grew dim, and his cheeks paled a little under their handsome sunburn.
+
+"Never think of it, my dear fellow--any more than if I'd never touched a brush."
+
+And his tone told me in a flash that he never thought of anything else.
+
+I moved away, instinctively embarrassed by my unexpected discovery; and as I turned, my eye fell on a small picture above the mantel-piece--the only object breaking the plain oak panelling of the room.
+
+"Oh, by Jove!" I said.
+
+It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall.
+
+"By Jove--a Stroud!" I cried.
+
+He was silent; but I felt him close behind me, breathing a little quickly.
+
+"What a wonder! Made with a dozen lines--but on everlasting foundations. You lucky chap, where did you get it?"
+
+He answered slowly: "Mrs. Stroud gave it to me."
+
+"Ah--I didn't know you even knew the Strouds. He was such an inflexible hermit."
+
+"I didn't--till after. . . . She sent for me to paint him when he was dead."
+
+"When he was dead? You?"
+
+I must have let a little too much amazement escape through my surprise, for he answered with a deprecating laugh: "Yes--she's an awful simpleton, you know, Mrs. Stroud. Her only idea was to have him done by a fashionable painter--ah, poor Stroud! She thought it the surest way of proclaiming his greatness--of forcing it on a purblind public. And at the moment I was _the_ fashionable painter."
+
+"Ah, poor Stroud--as you say. Was _that_ his history?"
+
+"That was his history. She believed in him, gloried in him--or thought she did. But she couldn't bear not to have all the drawing-rooms with her. She couldn't bear the fact that, on varnishing days, one could always get near enough to see his pictures. Poor woman! She's just a fragment groping for other fragments. Stroud is the only whole I ever knew."
+
+"You ever knew? But you just said--"
+
+Gisburn had a curious smile in his eyes.
+
+"Oh, I knew him, and he knew me--only it happened after he was dead."
+
+I dropped my voice instinctively. "When she sent for you?"
+
+"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"
+
+He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I couldn't look at that thing--couldn't face it. But I forced myself to put it here; and now it's cured me--cured me. That's the reason why I don't dabble any more, my dear Rickham; or rather Stroud himself is the reason."
+
+For the first time my idle curiosity about my companion turned into a serious desire to understand him better.
+
+"I wish you'd tell me how it happened," I said.
+
+He stood looking up at the sketch, and twirling between his fingers a cigarette he had forgotten to light. Suddenly he turned toward me.
+
+"I'd rather like to tell you--because I've always suspected you of loathing my work."
+
+I made a deprecating gesture, which he negatived with a good-humoured shrug.
+
+"Oh, I didn't care a straw when I believed in myself--and now it's an added tie between us!"
+
+He laughed slightly, without bitterness, and pushed one of the deep arm-chairs forward. "There: make yourself comfortable--and here are the cigars you like."
+
+He placed them at my elbow and continued to wander up and down the room, stopping now and then beneath the picture.
+
+"How it happened? I can tell you in five minutes--and it didn't take much longer to happen. . . . I can remember now how surprised and pleased I was when I got Mrs. Stroud's note. Of course, deep down, I had always _felt_ there was no one like him--only I had gone with the stream, echoed the usual platitudes about him, till I half got to think he was a failure, one of the kind that are left behind. By Jove, and he _was_ left behind--because he had come to stay! The rest of us had to let ourselves be swept along or go under, but he was high above the current--on everlasting foundations, as you say.
+
+"Well, I went off to the house in my most egregious mood--rather moved, Lord forgive me, at the pathos of poor Stroud's career of failure being crowned by the glory of my painting him! Of course I meant to do the picture for nothing--I told Mrs. Stroud so when she began to stammer something about her poverty. I remember getting off a prodigious phrase about the honour being _mine_--oh, I was princely, my dear Rickham! I was posing to myself like one of my own sitters.
+
+"Then I was taken up and left alone with him. I had sent all my traps in advance, and I had only to set up the easel and get to work. He had been dead only twenty-four hours, and he died suddenly, of heart disease, so that there had been no preliminary work of destruction--his face was clear and untouched. I had met him once or twice, years before, and thought him insignificant and dingy. Now I saw that he was superb.
+
+"I was glad at first, with a merely aesthetic satisfaction: glad to have my hand on such a 'subject.' Then his strange life-likeness began to affect me queerly--as I blocked the head in I felt as if he were watching me do it. The sensation was followed by the thought: if he _were_ watching me, what would he say to my way of working? My strokes began to go a little wild--I felt nervous and uncertain.
+
+"Once, when I looked up, I seemed to see a smile behind his close grayish beard--as if he had the secret, and were amusing himself by holding it back from me. That exasperated me still more. The secret? Why, I had a secret worth twenty of his! I dashed at the canvas furiously, and tried some of my bravura tricks. But they failed me, they crumbled. I saw that he wasn't watching the showy bits--I couldn't distract his attention; he just kept his eyes on the hard passages between. Those were the ones I had always shirked, or covered up with some lying paint. And how he saw through my lies!
+
+"I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done--just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke. . . .
+
+"I turned back to my work, and went on groping and muddling; then I looked at the donkey again. I saw that, when Stroud laid in the first stroke, he knew just what the end would be. He had possessed his subject, absorbed it, recreated it. When had I done that with any of my things? They hadn't been born of me--I had just adopted them. . . .
+
+"Hang it, Rickham, with that face watching me I couldn't do another stroke. The plain truth was, I didn't know where to put it--_I had never known_. Only, with my sitters and my public, a showy splash of colour covered up the fact--I just threw paint into their faces. . . . Well, paint was the one medium those dead eyes could see through--see straight to the tottering foundations underneath. Don't you know how, in talking a foreign language, even fluently, one says half the time not what one wants to but what one can? Well--that was the way I painted; and as he lay there and watched me, the thing they called my 'technique' collapsed like a house of cards. He didn't sneer, you understand, poor Stroud--he just lay there quietly watching, and on his lips, through the gray beard, I seemed to hear the question: 'Are you sure you know where you're coming out?'
+
+"If I could have painted that face, with that question on it, I should have done a great thing. The next greatest thing was to see that I couldn't--and that grace was given me. But, oh, at that minute, Rickham, was there anything on earth I wouldn't have given to have Stroud alive before me, and to hear him say: 'It's not too late--I'll show you how'?
+
+"It _was_ too late--it would have been, even if he'd been alive. I packed up my traps, and went down and told Mrs. Stroud. Of course I didn't tell her _that_--it would have been Greek to her. I simply said I couldn't paint him, that I was too moved. She rather liked the idea--she's so romantic! It was that that made her give me the donkey. But she was terribly upset at not getting the portrait--she did so want him 'done' by some one showy! At first I was afraid she wouldn't let me off--and at my wits' end I suggested Grindle. Yes, it was I who started Grindle: I told Mrs. Stroud he was the 'coming' man, and she told somebody else, and so it got to be true. . . . And he painted Stroud without wincing; and she hung the picture among her husband's things. . . ."
+
+He flung himself down in the arm-chair near mine, laid back his head, and clasping his arms beneath it, looked up at the picture above the chimney-piece.
+
+"I like to fancy that Stroud himself would have given it to me, if he'd been able to say what he thought that day."
+
+And, in answer to a question I put half-mechanically--"Begin again?" he flashed out. "When the one thing that brings me anywhere near him is that I knew enough to leave off?"
+
+He stood up and laid his hand on my shoulder with a laugh. "Only the irony of it is that I _am_ still painting--since Grindle's doing it for me! The Strouds stand alone, and happen once--but there's no exterminating our kind of art."
@@ -0,0 +1,174 @@
+"""
+Byte pair encoding utilities
+
+Code from https://github.com/openai/gpt-2/blob/master/src/encoder.py
+
+And modified code (download_vocab) from
+https://github.com/openai/gpt-2/blob/master/download_model.py
+
+Modified MIT License
+
+Software Copyright (c) 2019 OpenAI
+
+We don’t claim ownership of the content you create with GPT-2, so it is yours to do with as you please.
+We only ask that you use GPT-2 responsibly and clearly indicate your content was created using GPT-2.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
+associated documentation files (the "Software"), to deal in the Software without restriction,
+including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+The above copyright notice and this permission notice need not be included
+with content created by the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+OR OTHER DEALINGS IN THE SOFTWARE.
+
+
+"""
+
+import os
+import json
+import regex as re
+import requests
+from tqdm import tqdm
+from functools import lru_cache
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a significant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+class Encoder:
+    def __init__(self, encoder, bpe_merges, errors='replace'):
+        self.encoder = encoder
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        self.errors = errors # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def encode(self, text):
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
+
+def get_encoder(model_name, models_dir):
+    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
+        encoder = json.load(f)
+    with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
+        bpe_data = f.read()
+    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
+    return Encoder(
+        encoder=encoder,
+        bpe_merges=bpe_merges,
+    )
+
+
+def download_vocab():
+    # Modified code from
+    subdir = 'gpt2_model'
+    if not os.path.exists(subdir):
+        os.makedirs(subdir)
+    subdir = subdir.replace('\\','/') # needed for Windows
+
+    for filename in ['encoder.json', 'vocab.bpe']:
+
+        r = requests.get("https://openaipublic.blob.core.windows.net/gpt-2/models/117M" + "/" + filename, stream=True)
+
+        with open(os.path.join(subdir, filename), 'wb') as f:
+            file_size = int(r.headers["content-length"])
+            chunk_size = 1000
+            with tqdm(ncols=100, desc="Fetching " + filename, total=file_size, unit_scale=True) as pbar:
+                # 1k for chunk_size, since Ethernet packet size is around 1500 bytes
+                for chunk in r.iter_content(chunk_size=chunk_size):
+                    f.write(chunk)
+                    pbar.update(chunk_size)
@@ -0,0 +1,442 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a9adc3bf-353c-411e-a471-0e92786e7103",
+   "metadata": {},
+   "source": [
+    "# Using BytePair encodding from `tiktoken`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4036ffa3-0e5c-433a-a997-4ed7d33de0b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip install tiktoken"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "1c490fca-a48a-47fa-a299-322d1a08ad17",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tiktoken version: 0.5.2\n"
+     ]
+    }
+   ],
+   "source": [
+    "import importlib.metadata\n",
+    "\n",
+    "print(\"tiktoken version:\", importlib.metadata.version(\"tiktoken\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0952667c-ce84-4f21-87db-59f52b44cec4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "tik_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "text = \"Hello, world. Is this-- a test?\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "b039c350-18ad-48fb-8e6a-085702dfc330",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
+     ]
+    }
+   ],
+   "source": [
+    "integers = tik_tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n",
+    "\n",
+    "print(integers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "7b152ba4-04d3-41cc-849f-adedcfb8cabb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hello, world. Is this-- a test?\n"
+     ]
+    }
+   ],
+   "source": [
+    "strings = tik_tokenizer.decode(integers)\n",
+    "\n",
+    "print(strings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "cf148a1a-316b-43ec-b7ba-1b6d409ce837",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "50257\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tik_tokenizer.n_vocab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a0b5d4f-2af9-40de-828c-063c4243e771",
+   "metadata": {},
+   "source": [
+    "# Using the original Byte-pair encoding implementation used in GPT-2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "0903108c-65cb-4ae1-967a-2155e25349c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bpe_openai_gpt2 import get_encoder, download_vocab"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "35dd8d7c-8c12-4b68-941a-0fd05882dd45",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Fetching encoder.json: 1.04Mit [00:28, 36.8kit/s]                                                   \n",
+      "Fetching vocab.bpe: 457kit [00:00, 458kit/s]                                                        \n"
+     ]
+    }
+   ],
+   "source": [
+    "download_vocab()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "1888a7a9-9c40-4fe0-99b4-ebd20aa1ffd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "orig_tokenizer = get_encoder(model_name=\"gpt2_model\", models_dir=\".\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "2740510c-a78a-4fba-ae18-2b156ba2dfef",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
+     ]
+    }
+   ],
+   "source": [
+    "integers = orig_tokenizer.encode(text)\n",
+    "\n",
+    "print(integers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "434d115e-990d-42ad-88dd-31323a96e10f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hello, world. Is this-- a test?\n"
+     ]
+    }
+   ],
+   "source": [
+    "strings = orig_tokenizer.decode(integers)\n",
+    "\n",
+    "print(strings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f63e8c6-707c-4d66-bcf8-dd790647cc86",
+   "metadata": {},
+   "source": [
+    "# Using the BytePair Tokenizer in HuggingFace transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "5bfff386-f725-4137-9c50-e5da0c38bea0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pip install transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "e9077bf4-f91f-42ad-ab76-f3d89128510e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'4.30.2'"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import transformers\n",
+    "\n",
+    "transformers.__version__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "16e06ee5-c4ca-4211-8e24-dbfd84b1d85b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "设置为国内可访问"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3e07ddc9-187e-4482-a7b5-7e4e9381d805",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "env: HF_ENDPOINT=https://hf-mirror.com\n"
+     ]
+    }
+   ],
+   "source": [
+    "%env HF_ENDPOINT=https://hf-mirror.com"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a9839137-b8ea-4a2c-85fc-9a63064cf8c8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "afc151b540664287aa60a4cbe90cdfeb",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "vocab.json: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9a5d584e4adf40bca215b409b693dc02",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "merges.txt: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a126ee77a9f94e58b1dcccd68e6d5bb1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from transformers import GPT2Tokenizer\n",
+    "\n",
+    "hf_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "222cbd69-6a3d-4868-9c1f-421ffc9d5fe1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hf_tokenizer(strings)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "907a1ade-3401-4f2e-9017-7f58a60cbd98",
+   "metadata": {},
+   "source": [
+    "# A quick performance benchmark"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "a61bb445-b151-4a2f-8180-d4004c503754",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:\n",
+    "    raw_text = f.read()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "57f7c0a3-c1fd-4313-af34-68e78eb33653",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "9.14 ms ± 74.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit orig_tokenizer.encode(raw_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "036dd628-3591-46c9-a5ce-b20b105a8062",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%timeit tik_tokenizer.encode(raw_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9c85b58-bfbc-465e-9a7e-477e53d55c90",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%timeit hf_tokenizer(raw_text)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7117107f-22a6-46b4-a442-712d50b3ac7a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d81eaf6d-554b-44e3-aa19-2c3ae0030762",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,7 @@
+# Chapter 2: Working with Text Data
+
+
+
+- [compare-bpe-tiktoken.ipynb](compare-bpe-tiktoken.ipynb) benchmarks various byte pair encoding implementations
+- [bpe_openai_gpt2.py](bpe_openai_gpt2.py) is the original bytepair encoder code used by OpenAI
+
@@ -0,0 +1,174 @@
+"""
+Byte pair encoding utilities
+
+Code from https://github.com/openai/gpt-2/blob/master/src/encoder.py
+
+And modified code (download_vocab) from
+https://github.com/openai/gpt-2/blob/master/download_model.py
+
+Modified MIT License
+
+Software Copyright (c) 2019 OpenAI
+
+We don’t claim ownership of the content you create with GPT-2, so it is yours to do with as you please.
+We only ask that you use GPT-2 responsibly and clearly indicate your content was created using GPT-2.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
+associated documentation files (the "Software"), to deal in the Software without restriction,
+including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+The above copyright notice and this permission notice need not be included
+with content created by the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+OR OTHER DEALINGS IN THE SOFTWARE.
+
+
+"""
+
+import os
+import json
+import regex as re
+import requests
+from tqdm import tqdm
+from functools import lru_cache
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a significant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+class Encoder:
+    def __init__(self, encoder, bpe_merges, errors='replace'):
+        self.encoder = encoder
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        self.errors = errors # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def encode(self, text):
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
+
+def get_encoder(model_name, models_dir):
+    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
+        encoder = json.load(f)
+    with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
+        bpe_data = f.read()
+    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
+    return Encoder(
+        encoder=encoder,
+        bpe_merges=bpe_merges,
+    )
+
+
+def download_vocab():
+    # Modified code from
+    subdir = 'gpt2_model'
+    if not os.path.exists(subdir):
+        os.makedirs(subdir)
+    subdir = subdir.replace('\\','/') # needed for Windows
+
+    for filename in ['encoder.json', 'vocab.bpe']:
+
+        r = requests.get("https://openaipublic.blob.core.windows.net/gpt-2/models/117M" + "/" + filename, stream=True)
+
+        with open(os.path.join(subdir, filename), 'wb') as f:
+            file_size = int(r.headers["content-length"])
+            chunk_size = 1000
+            with tqdm(ncols=100, desc="Fetching " + filename, total=file_size, unit_scale=True) as pbar:
+                # 1k for chunk_size, since Ethernet packet size is around 1500 bytes
+                for chunk in r.iter_content(chunk_size=chunk_size):
+                    f.write(chunk)
+                    pbar.update(chunk_size)
@@ -0,0 +1,473 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a9adc3bf-353c-411e-a471-0e92786e7103",
+   "metadata": {},
+   "source": [
+    "# Using BytePair encodding from `tiktoken`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4036ffa3-0e5c-433a-a997-4ed7d33de0b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip install tiktoken"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "1c490fca-a48a-47fa-a299-322d1a08ad17",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tiktoken version: 0.5.2\n"
+     ]
+    }
+   ],
+   "source": [
+    "import importlib.metadata\n",
+    "\n",
+    "print(\"tiktoken version:\", importlib.metadata.version(\"tiktoken\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0952667c-ce84-4f21-87db-59f52b44cec4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "tik_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "text = \"Hello, world. Is this-- a test?\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "b039c350-18ad-48fb-8e6a-085702dfc330",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
+     ]
+    }
+   ],
+   "source": [
+    "integers = tik_tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n",
+    "\n",
+    "print(integers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "7b152ba4-04d3-41cc-849f-adedcfb8cabb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hello, world. Is this-- a test?\n"
+     ]
+    }
+   ],
+   "source": [
+    "strings = tik_tokenizer.decode(integers)\n",
+    "\n",
+    "print(strings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "cf148a1a-316b-43ec-b7ba-1b6d409ce837",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "50257\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tik_tokenizer.n_vocab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a0b5d4f-2af9-40de-828c-063c4243e771",
+   "metadata": {},
+   "source": [
+    "# Using the original Byte-pair encoding implementation used in GPT-2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "0903108c-65cb-4ae1-967a-2155e25349c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bpe_openai_gpt2 import get_encoder, download_vocab"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "35dd8d7c-8c12-4b68-941a-0fd05882dd45",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Fetching encoder.json: 1.04Mit [00:28, 36.8kit/s]                                                   \n",
+      "Fetching vocab.bpe: 457kit [00:00, 458kit/s]                                                        \n"
+     ]
+    }
+   ],
+   "source": [
+    "download_vocab()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "1888a7a9-9c40-4fe0-99b4-ebd20aa1ffd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "orig_tokenizer = get_encoder(model_name=\"gpt2_model\", models_dir=\".\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "2740510c-a78a-4fba-ae18-2b156ba2dfef",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]\n"
+     ]
+    }
+   ],
+   "source": [
+    "integers = orig_tokenizer.encode(text)\n",
+    "\n",
+    "print(integers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "434d115e-990d-42ad-88dd-31323a96e10f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Hello, world. Is this-- a test?\n"
+     ]
+    }
+   ],
+   "source": [
+    "strings = orig_tokenizer.decode(integers)\n",
+    "\n",
+    "print(strings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f63e8c6-707c-4d66-bcf8-dd790647cc86",
+   "metadata": {},
+   "source": [
+    "# Using the BytePair Tokenizer in HuggingFace transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "5bfff386-f725-4137-9c50-e5da0c38bea0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pip install transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "e9077bf4-f91f-42ad-ab76-f3d89128510e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'4.30.2'"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import transformers\n",
+    "\n",
+    "transformers.__version__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "16e06ee5-c4ca-4211-8e24-dbfd84b1d85b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "设置为国内可访问"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3e07ddc9-187e-4482-a7b5-7e4e9381d805",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "env: HF_ENDPOINT=https://hf-mirror.com\n"
+     ]
+    }
+   ],
+   "source": [
+    "%env HF_ENDPOINT=https://hf-mirror.com"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a9839137-b8ea-4a2c-85fc-9a63064cf8c8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "afc151b540664287aa60a4cbe90cdfeb",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "vocab.json: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9a5d584e4adf40bca215b409b693dc02",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "merges.txt: 0.00B [00:00, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a126ee77a9f94e58b1dcccd68e6d5bb1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from transformers import GPT2Tokenizer\n",
+    "\n",
+    "hf_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "222cbd69-6a3d-4868-9c1f-421ffc9d5fe1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hf_tokenizer(strings)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "907a1ade-3401-4f2e-9017-7f58a60cbd98",
+   "metadata": {},
+   "source": [
+    "# A quick performance benchmark"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "a61bb445-b151-4a2f-8180-d4004c503754",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:\n",
+    "    raw_text = f.read()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "57f7c0a3-c1fd-4313-af34-68e78eb33653",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "9.14 ms ± 74.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit orig_tokenizer.encode(raw_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "036dd628-3591-46c9-a5ce-b20b105a8062",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3.28 ms ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit tik_tokenizer.encode(raw_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "b9c85b58-bfbc-465e-9a7e-477e53d55c90",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "19.1 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit hf_tokenizer(raw_text)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "7117107f-22a6-46b4-a442-712d50b3ac7a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "18.8 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)[\"input_ids\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d81eaf6d-554b-44e3-aa19-2c3ae0030762",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,486 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "063850ab-22b0-4838-b53a-9bb11757d9d0",
+   "metadata": {},
+   "source": [
+    "# Embedding Layers and Linear Layers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0315c598-701f-46ff-8806-15813cad0e51",
+   "metadata": {},
+   "source": [
+    "- Embedding layers in PyTorch accomplish the same as linear layers that perform matrix multiplications; the reason we use embedding layers is computational efficiency\n",
+    "- We will take a look at this relationship step by step using code examples in PyTorch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "061720f4-f025-4640-82a0-15098fa94cf9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "PyTorch version: 2.1.0.post301\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "\n",
+    "print(\"PyTorch version:\", torch.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7895a66-7f69-4f62-9361-5c9da2eb76ef",
+   "metadata": {},
+   "source": [
+    "## Using nn.Embedding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "cc489ea5-73db-40b9-959e-0d70cae25f40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Suppose we have the following 3 training examples,\n",
+    "# which may represent token IDs in a LLM context\n",
+    "idx = torch.tensor([2, 3, 1])\n",
+    "\n",
+    "# The number of rows in the embedding matrix can be determined\n",
+    "# by obtaining the largest token ID + 1.\n",
+    "# If the highest token ID is 3, then we want 4 rows, for the possible\n",
+    "# token IDs 0, 1, 2, 3\n",
+    "num_idx = max(idx)+1\n",
+    "\n",
+    "# The desired embedding dimension is a hyperparameter\n",
+    "out_dim = 5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93d83a6e-8543-40af-b253-fe647640bf36",
+   "metadata": {},
+   "source": [
+    "- Let's implement a simple embedding layer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "60a7c104-36e1-4b28-bd02-a24a1099dc66",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We use the random seed for reproducibility since\n",
+    "# weights in the embedding layer are initialized with\n",
+    "# small random values\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "embedding = torch.nn.Embedding(num_idx, out_dim)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd96c00a-3297-4a50-8bfc-247aaea7e761",
+   "metadata": {},
+   "source": [
+    "We can optionally take a look at the embedding weights:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "595f603e-8d2a-4171-8f94-eac8106b2e57",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Parameter containing:\n",
+       "tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  1.5810],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015],\n",
+       "        [ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953]], requires_grad=True)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding.weight"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c86eb562-61e2-4171-ab6e-b410a1fd5c18",
+   "metadata": {},
+   "source": [
+    "- We can then use the embedding layers to obtain the vector representation of a training example with ID 1:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8bbc0255-4805-4be9-9f4c-1d0d967ef9d5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(torch.tensor([1]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a4d47f2-4691-47b8-9855-2528b6c285c9",
+   "metadata": {},
+   "source": [
+    "- Below is a visualization of what happens under the hood:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12ffd155-7190-44b1-b6b6-45b11d6fe83b",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/1.png\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87d1311b-cfb2-4afc-9e25-e4ecf35370d9",
+   "metadata": {},
+   "source": [
+    "- Similarly, we can use embedding layers to obtain the vector representation of a training example with ID 2:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "c309266a-c601-4633-9404-2e10b1cdde8c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(torch.tensor([2]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ad3b601-f68c-41b1-a28d-b624b94ef383",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/2.png\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27dd54bd-85ae-4887-9c5e-3139da361cf4",
+   "metadata": {},
+   "source": [
+    "- Now, let's convert all the training examples we have defined previously:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0191aa4b-f6a8-4b0d-9c36-65e82b81d071",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "idx = torch.tensor([2, 3, 1])\n",
+    "embedding(idx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "146cf8eb-c517-4cd4-aa91-0e818fed7651",
+   "metadata": {},
+   "source": [
+    "- Under the hood, it's still the same look-up concept:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b392eb43-0bda-4821-b446-b8dcbee8ae00",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/3.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0fe863b-d6a3-48f3-ace5-09ecd0eb7b59",
+   "metadata": {},
+   "source": [
+    "## Using nn.Linear"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "138de6a4-2689-4c1f-96af-7899b2d82a4e",
+   "metadata": {},
+   "source": [
+    "- Now, we will demonstrate that the embedding layer above accomplishes exactly the same as `nn.Linear` layer on a one-hot encoded representation in PyTorch\n",
+    "- First, let's convert the token IDs into a one-hot representation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "b5bb56cf-bc73-41ab-b107-91a43f77bdba",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[0, 0, 1, 0],\n",
+       "        [0, 0, 0, 1],\n",
+       "        [0, 1, 0, 0]])"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "onehot = torch.nn.functional.one_hot(idx)\n",
+    "onehot"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa45dfdf-fb26-4514-a176-75224f5f179b",
+   "metadata": {},
+   "source": [
+    "- Next, we initialize a `Linear` layer, which caries out a matrix multiplication $X W^\\top$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "ae04c1ed-242e-4dd7-b8f7-4b7e4caae383",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "linear = torch.nn.Linear(num_idx, out_dim, bias=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63efb98e-5cc4-4e8d-9fe6-ef0ad29ae2d7",
+   "metadata": {},
+   "source": [
+    "- Note that the linear layer in PyTorch is also initialized with small random weights; to directly compare it to the `Embedding` layer above, we have to use the same small random weights, which is why we reassign them here:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "a3b90d69-761c-486e-bd19-b38a2988fe62",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "linear.weight = torch.nn.Parameter(embedding.weight.T.detach())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9116482d-f1f9-45e2-9bf3-7ef5e9003898",
+   "metadata": {},
+   "source": [
+    "- Now we can use the linear layer on the one-hot encoded representation of the inputs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "90d2b0dd-9f1d-4c0f-bb16-1f6ce6b8ac2c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "linear(onehot.float())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6204bc8-92e2-4546-9cda-574fe1360fa2",
+   "metadata": {},
+   "source": [
+    "As we can see, this is exactly the same as what we got when we used the embedding layer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "2b057649-3176-4a54-b58c-fd8fbf818c61",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(idx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e447639-8952-460e-8c8f-cf9e23c368c9",
+   "metadata": {},
+   "source": [
+    "- What happens under the hood is the following computation for the first training example's token ID:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1830eccf-a707-4753-a24a-9b103f55594a",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/4.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ce5211a-14e6-46aa-a3a8-14609f086e97",
+   "metadata": {},
+   "source": [
+    "- And for the second training example's token ID:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "173f6026-a461-44da-b9c5-f571f8ec8bf3",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/5.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2608049-f5d1-49a9-a14b-82695fc32e6a",
+   "metadata": {},
+   "source": [
+    "- Since all but one index in each one-hot encoded row are 0 (by design), this matrix multiplication is essentially the same as a look-up of the one-hot elements\n",
+    "- This use of the matrix multiplication on one-hot encodings is equivalent to the embedding layer look-up but can be inefficient if we work with large embedding matrices, because there are a lot of wasteful multiplications by zero"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5eacc005-86fc-490c-8f6a-dc37d8a0df7c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1f63c81-1ee3-40a1-9ef2-14ff18fb4f05",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c71959bb-facf-44fd-8edb-b67f7752f034",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,3 @@
+# Chapter 2: Working with Text Data
+
+- [embeddings-and-linear-layers.ipynb](embeddings-and-linear-layers.ipynb) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
@@ -0,0 +1,486 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "063850ab-22b0-4838-b53a-9bb11757d9d0",
+   "metadata": {},
+   "source": [
+    "# Embedding Layers and Linear Layers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0315c598-701f-46ff-8806-15813cad0e51",
+   "metadata": {},
+   "source": [
+    "- Embedding layers in PyTorch accomplish the same as linear layers that perform matrix multiplications; the reason we use embedding layers is computational efficiency\n",
+    "- We will take a look at this relationship step by step using code examples in PyTorch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "061720f4-f025-4640-82a0-15098fa94cf9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "PyTorch version: 2.1.0.post301\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "\n",
+    "print(\"PyTorch version:\", torch.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7895a66-7f69-4f62-9361-5c9da2eb76ef",
+   "metadata": {},
+   "source": [
+    "## Using nn.Embedding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "cc489ea5-73db-40b9-959e-0d70cae25f40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Suppose we have the following 3 training examples,\n",
+    "# which may represent token IDs in a LLM context\n",
+    "idx = torch.tensor([2, 3, 1])\n",
+    "\n",
+    "# The number of rows in the embedding matrix can be determined\n",
+    "# by obtaining the largest token ID + 1.\n",
+    "# If the highest token ID is 3, then we want 4 rows, for the possible\n",
+    "# token IDs 0, 1, 2, 3\n",
+    "num_idx = max(idx)+1\n",
+    "\n",
+    "# The desired embedding dimension is a hyperparameter\n",
+    "out_dim = 5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93d83a6e-8543-40af-b253-fe647640bf36",
+   "metadata": {},
+   "source": [
+    "- Let's implement a simple embedding layer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "60a7c104-36e1-4b28-bd02-a24a1099dc66",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We use the random seed for reproducibility since\n",
+    "# weights in the embedding layer are initialized with\n",
+    "# small random values\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "embedding = torch.nn.Embedding(num_idx, out_dim)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd96c00a-3297-4a50-8bfc-247aaea7e761",
+   "metadata": {},
+   "source": [
+    "We can optionally take a look at the embedding weights:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "595f603e-8d2a-4171-8f94-eac8106b2e57",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Parameter containing:\n",
+       "tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  1.5810],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015],\n",
+       "        [ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953]], requires_grad=True)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding.weight"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c86eb562-61e2-4171-ab6e-b410a1fd5c18",
+   "metadata": {},
+   "source": [
+    "- We can then use the embedding layers to obtain the vector representation of a training example with ID 1:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8bbc0255-4805-4be9-9f4c-1d0d967ef9d5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(torch.tensor([1]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a4d47f2-4691-47b8-9855-2528b6c285c9",
+   "metadata": {},
+   "source": [
+    "- Below is a visualization of what happens under the hood:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12ffd155-7190-44b1-b6b6-45b11d6fe83b",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/1.png\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87d1311b-cfb2-4afc-9e25-e4ecf35370d9",
+   "metadata": {},
+   "source": [
+    "- Similarly, we can use embedding layers to obtain the vector representation of a training example with ID 2:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "c309266a-c601-4633-9404-2e10b1cdde8c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(torch.tensor([2]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ad3b601-f68c-41b1-a28d-b624b94ef383",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/2.png\" width=\"400px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27dd54bd-85ae-4887-9c5e-3139da361cf4",
+   "metadata": {},
+   "source": [
+    "- Now, let's convert all the training examples we have defined previously:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0191aa4b-f6a8-4b0d-9c36-65e82b81d071",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "idx = torch.tensor([2, 3, 1])\n",
+    "embedding(idx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "146cf8eb-c517-4cd4-aa91-0e818fed7651",
+   "metadata": {},
+   "source": [
+    "- Under the hood, it's still the same look-up concept:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b392eb43-0bda-4821-b446-b8dcbee8ae00",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/3.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0fe863b-d6a3-48f3-ace5-09ecd0eb7b59",
+   "metadata": {},
+   "source": [
+    "## Using nn.Linear"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "138de6a4-2689-4c1f-96af-7899b2d82a4e",
+   "metadata": {},
+   "source": [
+    "- Now, we will demonstrate that the embedding layer above accomplishes exactly the same as `nn.Linear` layer on a one-hot encoded representation in PyTorch\n",
+    "- First, let's convert the token IDs into a one-hot representation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "b5bb56cf-bc73-41ab-b107-91a43f77bdba",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[0, 0, 1, 0],\n",
+       "        [0, 0, 0, 1],\n",
+       "        [0, 1, 0, 0]])"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "onehot = torch.nn.functional.one_hot(idx)\n",
+    "onehot"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa45dfdf-fb26-4514-a176-75224f5f179b",
+   "metadata": {},
+   "source": [
+    "- Next, we initialize a `Linear` layer, which caries out a matrix multiplication $X W^\\top$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "ae04c1ed-242e-4dd7-b8f7-4b7e4caae383",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "linear = torch.nn.Linear(num_idx, out_dim, bias=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63efb98e-5cc4-4e8d-9fe6-ef0ad29ae2d7",
+   "metadata": {},
+   "source": [
+    "- Note that the linear layer in PyTorch is also initialized with small random weights; to directly compare it to the `Embedding` layer above, we have to use the same small random weights, which is why we reassign them here:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "a3b90d69-761c-486e-bd19-b38a2988fe62",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "linear.weight = torch.nn.Parameter(embedding.weight.T.detach())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9116482d-f1f9-45e2-9bf3-7ef5e9003898",
+   "metadata": {},
+   "source": [
+    "- Now we can use the linear layer on the one-hot encoded representation of the inputs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "90d2b0dd-9f1d-4c0f-bb16-1f6ce6b8ac2c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "linear(onehot.float())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6204bc8-92e2-4546-9cda-574fe1360fa2",
+   "metadata": {},
+   "source": [
+    "As we can see, this is exactly the same as what we got when we used the embedding layer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "2b057649-3176-4a54-b58c-fd8fbf818c61",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],\n",
+       "        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],\n",
+       "        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],\n",
+       "       grad_fn=<EmbeddingBackward0>)"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "embedding(idx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e447639-8952-460e-8c8f-cf9e23c368c9",
+   "metadata": {},
+   "source": [
+    "- What happens under the hood is the following computation for the first training example's token ID:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1830eccf-a707-4753-a24a-9b103f55594a",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/4.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ce5211a-14e6-46aa-a3a8-14609f086e97",
+   "metadata": {},
+   "source": [
+    "- And for the second training example's token ID:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "173f6026-a461-44da-b9c5-f571f8ec8bf3",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/5.png\" width=\"450px\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2608049-f5d1-49a9-a14b-82695fc32e6a",
+   "metadata": {},
+   "source": [
+    "- Since all but one index in each one-hot encoded row are 0 (by design), this matrix multiplication is essentially the same as a look-up of the one-hot elements\n",
+    "- This use of the matrix multiplication on one-hot encodings is equivalent to the embedding layer look-up but can be inefficient if we work with large embedding matrices, because there are a lot of wasteful multiplications by zero"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5eacc005-86fc-490c-8f6a-dc37d8a0df7c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1f63c81-1ee3-40a1-9ef2-14ff18fb4f05",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c71959bb-facf-44fd-8edb-b67f7752f034",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,7 @@
+# Chapter 2: Working with Text Data
+
+- [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions
+  
+- [02_bonus_bytepair-encoder](02_bonus_bytepair-encoder) contains optional code to benchmark different byte pair encoder implementations
+  
+- [03_bonus_embedding-vs-matmul](03_bonus_embedding-vs-matmul) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
@@ -0,0 +1,5 @@
+# Chapter 3: Coding Attention Mechanisms
+
+- [ch03.ipynb](ch03.ipynb) contains all the code as it appears in the chapter
+- [multihead-attention.ipynb](multihead-attention.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
+
@@ -0,0 +1,308 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
+   "metadata": {},
+   "source": [
+    "# Chapter 3 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33dfa199-9aee-41d4-a64b-7e3811b9a616",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "inputs = torch.tensor(\n",
+    "  [[0.43, 0.15, 0.89], # Your     (x^1)\n",
+    "   [0.55, 0.87, 0.66], # journey  (x^2)\n",
+    "   [0.57, 0.85, 0.64], # starts   (x^3)\n",
+    "   [0.22, 0.58, 0.33], # with     (x^4)\n",
+    "   [0.77, 0.25, 0.10], # one      (x^5)\n",
+    "   [0.05, 0.80, 0.55]] # step     (x^6)\n",
+    ")\n",
+    "\n",
+    "d_in, d_out = 3, 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "id": "62ea289c-41cd-4416-89dd-dde6383a6f70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "\n",
+    "class SelfAttention_v1(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "        self.W_value = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        keys = x @ self.W_key\n",
+    "        queries = x @ self.W_query\n",
+    "        values = x @ self.W_value\n",
+    "        \n",
+    "        attn_scores = queries @ keys.T # omega\n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "sa_v1 = SelfAttention_v1(d_in, d_out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "7b035143-f4e8-45fb-b398-dec1bd5153d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SelfAttention_v2(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=False)\n",
+    "        self.W_key   = nn.Linear(d_in, d_out, bias=False)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=False)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        keys = self.W_key(x)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "        \n",
+    "        attn_scores = queries @ keys.T\n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "sa_v2 = SelfAttention_v2(d_in, d_out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "id": "7591d79c-c30e-406d-adfd-20c12eb448f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sa_v1.W_query = torch.nn.Parameter(sa_v2.W_query.weight.T)\n",
+    "sa_v1.W_key = torch.nn.Parameter(sa_v2.W_key.weight.T)\n",
+    "sa_v1.W_value = torch.nn.Parameter(sa_v2.W_value.weight.T)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "id": "ddd0f54f-6bce-46cc-a428-17c2a56557d0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-0.5337, -0.1051],\n",
+       "        [-0.5323, -0.1080],\n",
+       "        [-0.5323, -0.1079],\n",
+       "        [-0.5297, -0.1076],\n",
+       "        [-0.5311, -0.1066],\n",
+       "        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sa_v1(inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "340908f8-1144-4ddd-a9e1-a1c5c3d592f5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-0.5337, -0.1051],\n",
+       "        [-0.5323, -0.1080],\n",
+       "        [-0.5323, -0.1079],\n",
+       "        [-0.5297, -0.1076],\n",
+       "        [-0.5311, -0.1066],\n",
+       "        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 62,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sa_v2(inputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33543edb-46b5-4b01-8704-f7f101230544",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0588e209-1644-496a-8dae-7630b4ef9083",
+   "metadata": {},
+   "source": [
+    "If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18e748ef-3106-4e11-a781-b230b74a0cef",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "d_out = 1\n",
+    "mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)\n",
+    "\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(context_vecs)\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78234544-d989-4f71-ac28-85a7ec1e6b7b",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "tensor([[[-9.1476e-02,  3.4164e-02],\n",
+    "         [-2.6796e-01, -1.3427e-03],\n",
+    "         [-4.8421e-01, -4.8909e-02],\n",
+    "         [-6.4808e-01, -1.0625e-01],\n",
+    "         [-8.8380e-01, -1.7140e-01],\n",
+    "         [-1.4744e+00, -3.4327e-01]],\n",
+    "\n",
+    "        [[-9.1476e-02,  3.4164e-02],\n",
+    "         [-2.6796e-01, -1.3427e-03],\n",
+    "         [-4.8421e-01, -4.8909e-02],\n",
+    "         [-6.4808e-01, -1.0625e-01],\n",
+    "         [-8.8380e-01, -1.7140e-01],\n",
+    "         [-1.4744e+00, -3.4327e-01]]], grad_fn=<CatBackward0>)\n",
+    "context_vecs.shape: torch.Size([2, 6, 2])\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92bdabcb-06cf-4576-b810-d883bbd313ba",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84c9b963-d01f-46e6-96bf-8eb2a54c5e42",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "block_size = 1024\n",
+    "d_in, d_out = 768, 768\n",
+    "num_heads = 12\n",
+    "\n",
+    "mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "375d5290-8e8b-4149-958e-1efb58a69191",
+   "metadata": {},
+   "source": [
+    "Optionally, the number of parameters is as follows:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6d7e603c-1658-4da9-9c0b-ef4bc72832b4",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "def count_parameters(model):\n",
+    "    return sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "\n",
+    "count_parameters(mha)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51ba00bd-feb0-4424-84cb-7c2b1f908779",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "2360064  # (2.36 M)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a56c1d47-9b95-4bd1-a517-580a6f779c52",
+   "metadata": {},
+   "source": [
+    "The GPT-2 model has 117M parameters in total, but as we can see, most of its parameters are not in the multi-head attention module itself."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,358 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# Multi-head Attention Plus Data Loading"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "070000fc-a7b7-4c56-a2c0-a938d413a790",
+   "metadata": {},
+   "source": [
+    "The complete chapter code is located in [ch03.ipynb](./ch03.ipynb).\n",
+    "\n",
+    "This notebook contains the main takeaway, multihead-attention implementation (plus the data loading pipeline from chapter 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f60dc93-281d-447e-941f-aede0c7ff7fc",
+   "metadata": {},
+   "source": [
+    "## Data Loader from Chapter 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "with open(\"small-text-sample.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "max_len = 1024\n",
+    "block_size = max_len\n",
+    "\n",
+    "\n",
+    "token_embedding_layer = nn.Embedding(vocab_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "\n",
+    "max_length = 4\n",
+    "dataloader = create_dataloader(raw_text, batch_size=8, max_length=max_length, stride=5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "\n",
+    "    token_embeddings = token_embedding_layer(x)\n",
+    "    pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
+    "\n",
+    "    input_embeddings = token_embeddings + pos_embeddings\n",
+    "\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(input_embeddings.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd298bf4-e320-40c1-9084-6526d07e6d5d",
+   "metadata": {},
+   "source": [
+    "# Multi-head Attention from Chapter 3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58b2297b-a001-49fd-994c-b1700866cd01",
+   "metadata": {},
+   "source": [
+    "## Variant A: Simple implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a44e682d-1c3c-445d-85fa-b142f89f8503",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class CausalSelfAttention(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.dropout = nn.Dropout(dropout) # New\n",
+    "        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        b, n_tokens, d_in = x.shape # New batch dimension b\n",
+    "        keys = self.W_key(x)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "\n",
+    "        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose\n",
+    "        attn_scores.masked_fill_(  # New, _ ops are in-place\n",
+    "            self.mask.bool()[:n_tokens, :n_tokens], -torch.inf) \n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n",
+    "        attn_weights = self.dropout(attn_weights) # New\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "\n",
+    "class MultiHeadAttentionWrapper(nn.Module):\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        self.heads = nn.ModuleList(\n",
+    "            [CausalSelfAttention(d_in, d_out, block_size, dropout, qkv_bias) \n",
+    "             for _ in range(num_heads)]\n",
+    "        )\n",
+    "        self.out_proj = nn.Linear(d_out*num_heads, d_out*num_heads)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        context_vec = torch.cat([head(x) for head in self.heads], dim=-1)\n",
+    "        return self.out_proj(context_vec)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7898551e-f582-48ac-9f66-3632abe2a93f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "context_vecs.shape: torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "\n",
+    "block_size = max_length\n",
+    "d_in = output_dim\n",
+    "\n",
+    "num_heads=2\n",
+    "d_out = d_in // num_heads\n",
+    "\n",
+    "mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads)\n",
+    "\n",
+    "batch = input_embeddings\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e288239-5146-424d-97fe-74024ae711b9",
+   "metadata": {},
+   "source": [
+    "## Variant B: Alternative implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2773c09d-c136-4372-a2be-04b58d292842",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class MultiHeadAttention(nn.Module):\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        assert d_out % num_heads == 0, \"d_out must be divisible by n_heads\"\n",
+    "\n",
+    "        self.d_out = d_out\n",
+    "        self.num_heads = num_heads\n",
+    "        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim\n",
+    "\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1))\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        b, num_tokens, d_in = x.shape\n",
+    "\n",
+    "        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "\n",
+    "        # We implicitly split the matrix by adding a `num_heads` dimension\n",
+    "        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n",
+    "        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) \n",
+    "        values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n",
+    "        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n",
+    "\n",
+    "        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n",
+    "        keys = keys.transpose(1, 2)\n",
+    "        queries = queries.transpose(1, 2)\n",
+    "        values = values.transpose(1, 2)\n",
+    "\n",
+    "        # Compute scaled dot-product attention (aka self-attention) with a causal mask\n",
+    "        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head\n",
+    "        # Original mask truncated to the number of tokens and converted to boolean\n",
+    "        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n",
+    "        # Unsqueeze the mask twice to match dimensions\n",
+    "        mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)\n",
+    "        # Use the unsqueezed mask to fill attention scores\n",
+    "        attn_scores.masked_fill_(mask_unsqueezed, -torch.inf)\n",
+    "        \n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
+    "        attn_weights = self.dropout(attn_weights)\n",
+    "\n",
+    "        # Shape: (b, num_tokens, num_heads, head_dim)\n",
+    "        context_vec = (attn_weights @ values).transpose(1, 2) \n",
+    "        \n",
+    "        # Combine heads, where self.d_out = self.num_heads * self.head_dim\n",
+    "        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n",
+    "        context_vec = self.out_proj(context_vec) # optional projection\n",
+    "\n",
+    "        return context_vec"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "779fdd04-0152-4308-af08-840800a7f395",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "context_vecs.shape: torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "\n",
+    "block_size = max_length\n",
+    "d_in = output_dim\n",
+    "d_out = d_in\n",
+    "\n",
+    "mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads=2)\n",
+    "\n",
+    "batch = input_embeddings\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8d4be84-28bb-41d5-996c-4936acffd411",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,9 @@
+Once upon a time in a quiet village nestled among rolling hills and whispering forests, there lived a young girl named Elara. Elara was known for her boundless curiosity and her love for the stars. Every night, she would climb to the highest hill near her home to gaze at the glittering sky, dreaming of distant worlds and galaxies.
+
+In the heart of the village, there was an ancient library, tended by an old, wise librarian named Mr. Bramwell. This library was a treasure trove of books on every subject, but most importantly, it housed a collection of old star maps and celestial guides. Elara, fascinated by these books, spent countless hours with Mr. Bramwell, learning about constellations, planets, and the mysteries of the universe.
+
+One evening, while studying an old star map, Elara noticed a small, uncharted star that twinkled differently. She shared this discovery with Mr. Bramwell, who was equally intrigued. They decided to observe this star every night, noting its unique patterns and movements. This small, mysterious star, which they named "Elara's Star," became the center of their nightly adventures.
+
+As days turned into weeks, the villagers began to take notice of Elara's star. The uncharted star brought the community together, with people of all ages joining Elara and Mr. Bramwell on the hill each night to gaze at the sky. The nightly gatherings turned into a festival of stars, where stories were shared, friendships were formed, and the mysteries of the cosmos were contemplated.
+
+The story of Elara and her star spread far and wide, attracting astronomers and dreamers from distant lands. The once quiet village became a beacon of wonder, a place where the sky seemed a little closer and the stars a bit friendlier. Elara's curiosity had not only unveiled a hidden star but had also brought her community together, reminding everyone that sometimes, the most extraordinary discoveries are waiting just above us, in the starlit sky.
@@ -0,0 +1,5 @@
+# Chapter 3: Coding Attention Mechanisms
+
+- [ch03.ipynb](ch03.ipynb) contains all the code as it appears in the chapter
+- [multihead-attention.ipynb](multihead-attention.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
+
@@ -0,0 +1,308 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
+   "metadata": {},
+   "source": [
+    "# Chapter 3 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33dfa199-9aee-41d4-a64b-7e3811b9a616",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "inputs = torch.tensor(\n",
+    "  [[0.43, 0.15, 0.89], # Your     (x^1)\n",
+    "   [0.55, 0.87, 0.66], # journey  (x^2)\n",
+    "   [0.57, 0.85, 0.64], # starts   (x^3)\n",
+    "   [0.22, 0.58, 0.33], # with     (x^4)\n",
+    "   [0.77, 0.25, 0.10], # one      (x^5)\n",
+    "   [0.05, 0.80, 0.55]] # step     (x^6)\n",
+    ")\n",
+    "\n",
+    "d_in, d_out = 3, 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "id": "62ea289c-41cd-4416-89dd-dde6383a6f70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "\n",
+    "class SelfAttention_v1(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "        self.W_value = nn.Parameter(torch.rand(d_in, d_out))\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        keys = x @ self.W_key\n",
+    "        queries = x @ self.W_query\n",
+    "        values = x @ self.W_value\n",
+    "        \n",
+    "        attn_scores = queries @ keys.T # omega\n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "sa_v1 = SelfAttention_v1(d_in, d_out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "7b035143-f4e8-45fb-b398-dec1bd5153d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SelfAttention_v2(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=False)\n",
+    "        self.W_key   = nn.Linear(d_in, d_out, bias=False)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=False)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        keys = self.W_key(x)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "        \n",
+    "        attn_scores = queries @ keys.T\n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "sa_v2 = SelfAttention_v2(d_in, d_out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "id": "7591d79c-c30e-406d-adfd-20c12eb448f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sa_v1.W_query = torch.nn.Parameter(sa_v2.W_query.weight.T)\n",
+    "sa_v1.W_key = torch.nn.Parameter(sa_v2.W_key.weight.T)\n",
+    "sa_v1.W_value = torch.nn.Parameter(sa_v2.W_value.weight.T)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "id": "ddd0f54f-6bce-46cc-a428-17c2a56557d0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-0.5337, -0.1051],\n",
+       "        [-0.5323, -0.1080],\n",
+       "        [-0.5323, -0.1079],\n",
+       "        [-0.5297, -0.1076],\n",
+       "        [-0.5311, -0.1066],\n",
+       "        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sa_v1(inputs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "340908f8-1144-4ddd-a9e1-a1c5c3d592f5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-0.5337, -0.1051],\n",
+       "        [-0.5323, -0.1080],\n",
+       "        [-0.5323, -0.1079],\n",
+       "        [-0.5297, -0.1076],\n",
+       "        [-0.5311, -0.1066],\n",
+       "        [-0.5299, -0.1081]], grad_fn=<MmBackward0>)"
+      ]
+     },
+     "execution_count": 62,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sa_v2(inputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33543edb-46b5-4b01-8704-f7f101230544",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0588e209-1644-496a-8dae-7630b4ef9083",
+   "metadata": {},
+   "source": [
+    "If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18e748ef-3106-4e11-a781-b230b74a0cef",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "d_out = 1\n",
+    "mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)\n",
+    "\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(context_vecs)\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78234544-d989-4f71-ac28-85a7ec1e6b7b",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "tensor([[[-9.1476e-02,  3.4164e-02],\n",
+    "         [-2.6796e-01, -1.3427e-03],\n",
+    "         [-4.8421e-01, -4.8909e-02],\n",
+    "         [-6.4808e-01, -1.0625e-01],\n",
+    "         [-8.8380e-01, -1.7140e-01],\n",
+    "         [-1.4744e+00, -3.4327e-01]],\n",
+    "\n",
+    "        [[-9.1476e-02,  3.4164e-02],\n",
+    "         [-2.6796e-01, -1.3427e-03],\n",
+    "         [-4.8421e-01, -4.8909e-02],\n",
+    "         [-6.4808e-01, -1.0625e-01],\n",
+    "         [-8.8380e-01, -1.7140e-01],\n",
+    "         [-1.4744e+00, -3.4327e-01]]], grad_fn=<CatBackward0>)\n",
+    "context_vecs.shape: torch.Size([2, 6, 2])\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92bdabcb-06cf-4576-b810-d883bbd313ba",
+   "metadata": {},
+   "source": [
+    "# Exercise 3.3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84c9b963-d01f-46e6-96bf-8eb2a54c5e42",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "block_size = 1024\n",
+    "d_in, d_out = 768, 768\n",
+    "num_heads = 12\n",
+    "\n",
+    "mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "375d5290-8e8b-4149-958e-1efb58a69191",
+   "metadata": {},
+   "source": [
+    "Optionally, the number of parameters is as follows:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6d7e603c-1658-4da9-9c0b-ef4bc72832b4",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "def count_parameters(model):\n",
+    "    return sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "\n",
+    "count_parameters(mha)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51ba00bd-feb0-4424-84cb-7c2b1f908779",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "2360064  # (2.36 M)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a56c1d47-9b95-4bd1-a517-580a6f779c52",
+   "metadata": {},
+   "source": [
+    "The GPT-2 model has 117M parameters in total, but as we can see, most of its parameters are not in the multi-head attention module itself."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,358 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# Multi-head Attention Plus Data Loading"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "070000fc-a7b7-4c56-a2c0-a938d413a790",
+   "metadata": {},
+   "source": [
+    "The complete chapter code is located in [ch03.ipynb](./ch03.ipynb).\n",
+    "\n",
+    "This notebook contains the main takeaway, multihead-attention implementation (plus the data loading pipeline from chapter 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f60dc93-281d-447e-941f-aede0c7ff7fc",
+   "metadata": {},
+   "source": [
+    "## Data Loader from Chapter 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "\n",
+    "with open(\"small-text-sample.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "encoded_text = tokenizer.encode(raw_text)\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "max_len = 1024\n",
+    "block_size = max_len\n",
+    "\n",
+    "\n",
+    "token_embedding_layer = nn.Embedding(vocab_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)\n",
+    "\n",
+    "max_length = 4\n",
+    "dataloader = create_dataloader(raw_text, batch_size=8, max_length=max_length, stride=5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "\n",
+    "    token_embeddings = token_embedding_layer(x)\n",
+    "    pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
+    "\n",
+    "    input_embeddings = token_embeddings + pos_embeddings\n",
+    "\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(input_embeddings.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd298bf4-e320-40c1-9084-6526d07e6d5d",
+   "metadata": {},
+   "source": [
+    "# Multi-head Attention from Chapter 3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58b2297b-a001-49fd-994c-b1700866cd01",
+   "metadata": {},
+   "source": [
+    "## Variant A: Simple implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a44e682d-1c3c-445d-85fa-b142f89f8503",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class CausalSelfAttention(nn.Module):\n",
+    "\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        self.d_out = d_out\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.dropout = nn.Dropout(dropout) # New\n",
+    "        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        b, n_tokens, d_in = x.shape # New batch dimension b\n",
+    "        keys = self.W_key(x)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "\n",
+    "        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose\n",
+    "        attn_scores.masked_fill_(  # New, _ ops are in-place\n",
+    "            self.mask.bool()[:n_tokens, :n_tokens], -torch.inf) \n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)\n",
+    "        attn_weights = self.dropout(attn_weights) # New\n",
+    "\n",
+    "        context_vec = attn_weights @ values\n",
+    "        return context_vec\n",
+    "\n",
+    "\n",
+    "class MultiHeadAttentionWrapper(nn.Module):\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        self.heads = nn.ModuleList(\n",
+    "            [CausalSelfAttention(d_in, d_out, block_size, dropout, qkv_bias) \n",
+    "             for _ in range(num_heads)]\n",
+    "        )\n",
+    "        self.out_proj = nn.Linear(d_out*num_heads, d_out*num_heads)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        context_vec = torch.cat([head(x) for head in self.heads], dim=-1)\n",
+    "        return self.out_proj(context_vec)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7898551e-f582-48ac-9f66-3632abe2a93f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "context_vecs.shape: torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "\n",
+    "block_size = max_length\n",
+    "d_in = output_dim\n",
+    "\n",
+    "num_heads=2\n",
+    "d_out = d_in // num_heads\n",
+    "\n",
+    "mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads)\n",
+    "\n",
+    "batch = input_embeddings\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e288239-5146-424d-97fe-74024ae711b9",
+   "metadata": {},
+   "source": [
+    "## Variant B: Alternative implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2773c09d-c136-4372-a2be-04b58d292842",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class MultiHeadAttention(nn.Module):\n",
+    "    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):\n",
+    "        super().__init__()\n",
+    "        assert d_out % num_heads == 0, \"d_out must be divisible by n_heads\"\n",
+    "\n",
+    "        self.d_out = d_out\n",
+    "        self.num_heads = num_heads\n",
+    "        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim\n",
+    "\n",
+    "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
+    "        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1))\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        b, num_tokens, d_in = x.shape\n",
+    "\n",
+    "        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)\n",
+    "        queries = self.W_query(x)\n",
+    "        values = self.W_value(x)\n",
+    "\n",
+    "        # We implicitly split the matrix by adding a `num_heads` dimension\n",
+    "        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n",
+    "        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) \n",
+    "        values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n",
+    "        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n",
+    "\n",
+    "        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n",
+    "        keys = keys.transpose(1, 2)\n",
+    "        queries = queries.transpose(1, 2)\n",
+    "        values = values.transpose(1, 2)\n",
+    "\n",
+    "        # Compute scaled dot-product attention (aka self-attention) with a causal mask\n",
+    "        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head\n",
+    "        # Original mask truncated to the number of tokens and converted to boolean\n",
+    "        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n",
+    "        # Unsqueeze the mask twice to match dimensions\n",
+    "        mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)\n",
+    "        # Use the unsqueezed mask to fill attention scores\n",
+    "        attn_scores.masked_fill_(mask_unsqueezed, -torch.inf)\n",
+    "        \n",
+    "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
+    "        attn_weights = self.dropout(attn_weights)\n",
+    "\n",
+    "        # Shape: (b, num_tokens, num_heads, head_dim)\n",
+    "        context_vec = (attn_weights @ values).transpose(1, 2) \n",
+    "        \n",
+    "        # Combine heads, where self.d_out = self.num_heads * self.head_dim\n",
+    "        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n",
+    "        context_vec = self.out_proj(context_vec) # optional projection\n",
+    "\n",
+    "        return context_vec"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "779fdd04-0152-4308-af08-840800a7f395",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "context_vecs.shape: torch.Size([8, 4, 256])\n"
+     ]
+    }
+   ],
+   "source": [
+    "torch.manual_seed(123)\n",
+    "\n",
+    "block_size = max_length\n",
+    "d_in = output_dim\n",
+    "d_out = d_in\n",
+    "\n",
+    "mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads=2)\n",
+    "\n",
+    "batch = input_embeddings\n",
+    "context_vecs = mha(batch)\n",
+    "\n",
+    "print(\"context_vecs.shape:\", context_vecs.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8d4be84-28bb-41d5-996c-4936acffd411",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,9 @@
+Once upon a time in a quiet village nestled among rolling hills and whispering forests, there lived a young girl named Elara. Elara was known for her boundless curiosity and her love for the stars. Every night, she would climb to the highest hill near her home to gaze at the glittering sky, dreaming of distant worlds and galaxies.
+
+In the heart of the village, there was an ancient library, tended by an old, wise librarian named Mr. Bramwell. This library was a treasure trove of books on every subject, but most importantly, it housed a collection of old star maps and celestial guides. Elara, fascinated by these books, spent countless hours with Mr. Bramwell, learning about constellations, planets, and the mysteries of the universe.
+
+One evening, while studying an old star map, Elara noticed a small, uncharted star that twinkled differently. She shared this discovery with Mr. Bramwell, who was equally intrigued. They decided to observe this star every night, noting its unique patterns and movements. This small, mysterious star, which they named "Elara's Star," became the center of their nightly adventures.
+
+As days turned into weeks, the villagers began to take notice of Elara's star. The uncharted star brought the community together, with people of all ages joining Elara and Mr. Bramwell on the hill each night to gaze at the sky. The nightly gatherings turned into a festival of stars, where stories were shared, friendships were formed, and the mysteries of the cosmos were contemplated.
+
+The story of Elara and her star spread far and wide, attracting astronomers and dreamers from distant lands. The once quiet village became a beacon of wonder, a place where the sky seemed a little closer and the stars a bit friendlier. Elara's curiosity had not only unveiled a hidden star but had also brought her community together, reminding everyone that sometimes, the most extraordinary discoveries are waiting just above us, in the starlit sky.
@@ -0,0 +1,3 @@
+# Chapter 3: Coding Attention Mechanisms
+
+- [01_main-chapter-code](01_main-chapter-code) contains the main chapter code.
@@ -0,0 +1,380 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
+   "metadata": {},
+   "source": [
+    "# Chapter 4 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.1: Parameters in the feed forward versus attention module"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "2751b0e5-ffd3-4be2-8db3-e20dd4d61d69",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from gpt import TransformerBlock\n",
+    "\n",
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate\": 0.1,\n",
+    "    \"qkv_bias\": False\n",
+    "}\n",
+    "\n",
+    "block = TransformerBlock(GPT_CONFIG_124M)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "1bcaffd1-0cf6-4f8f-bd53-ab88a37f443e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of parameters in feed forward module: 4,722,432\n"
+     ]
+    }
+   ],
+   "source": [
+    "total_params = sum(p.numel() for p in block.ff.parameters())\n",
+    "print(f\"Total number of parameters in feed forward module: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c1dd06c1-ab6c-4df7-ba73-f9cd54b31138",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of parameters in attention module: 2,360,064\n"
+     ]
+    }
+   ],
+   "source": [
+    "total_params = sum(p.numel() for p in block.att.parameters())\n",
+    "print(f\"Total number of parameters in attention module: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15463dec-520a-47b4-b3ad-e180394fd076",
+   "metadata": {},
+   "source": [
+    "- The results above are for a single transformer block\n",
+    "- Optionally multiply by 12 to capture all transformer blocks in the 124M GPT model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f7b7c7f-0fa1-4d30-ab44-e499edd55b6d",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.2: Initialize larger GPT models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "310b2e05-3ec8-47fc-afd9-83bf03d4aad8",
+   "metadata": {},
+   "source": [
+    "- **GPT2-small** (the 124M configuration we already implemented):\n",
+    "    - \"emb_dim\" = 768\n",
+    "    - \"n_layers\" = 12\n",
+    "    - \"n_heads\" = 12\n",
+    "\n",
+    "- **GPT2-medium:**\n",
+    "    - \"emb_dim\" = 1024\n",
+    "    - \"n_layers\" = 24\n",
+    "    - \"n_heads\" = 16\n",
+    "\n",
+    "- **GPT2-large:**\n",
+    "    - \"emb_dim\" = 1280\n",
+    "    - \"n_layers\" = 36\n",
+    "    - \"n_heads\" = 20\n",
+    "\n",
+    "- **GPT2-XL:**\n",
+    "    - \"emb_dim\" = 1600\n",
+    "    - \"n_layers\" = 48\n",
+    "    - \"n_heads\" = 25"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "90185dea-81ca-4cdc-aef7-4aaf95cba946",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate\": 0.1,\n",
+    "    \"qkv_bias\": False\n",
+    "}\n",
+    "\n",
+    "\n",
+    "def get_config(base_config, model_name=\"gpt2-small\"):\n",
+    "    GPT_CONFIG = base_config.copy()\n",
+    "\n",
+    "    if model_name == \"gpt2-small\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 768\n",
+    "        GPT_CONFIG[\"n_layers\"] = 12\n",
+    "        GPT_CONFIG[\"n_heads\"] = 12\n",
+    "\n",
+    "    elif model_name == \"gpt2-medium\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1024\n",
+    "        GPT_CONFIG[\"n_layers\"] = 24\n",
+    "        GPT_CONFIG[\"n_heads\"] = 16\n",
+    "\n",
+    "    elif model_name == \"gpt2-large\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1280\n",
+    "        GPT_CONFIG[\"n_layers\"] = 36\n",
+    "        GPT_CONFIG[\"n_heads\"] = 20\n",
+    "\n",
+    "    elif model_name == \"gpt2-xl\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1600\n",
+    "        GPT_CONFIG[\"n_layers\"] = 48\n",
+    "        GPT_CONFIG[\"n_heads\"] = 25\n",
+    "\n",
+    "    else:\n",
+    "        raise ValueError(f\"Incorrect model name {model_name}\")\n",
+    "\n",
+    "    return GPT_CONFIG\n",
+    "\n",
+    "\n",
+    "def calculate_size(model): # based on chapter code\n",
+    "    \n",
+    "    total_params = sum(p.numel() for p in model.parameters())\n",
+    "    print(f\"Total number of parameters: {total_params:,}\")\n",
+    "\n",
+    "    total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())\n",
+    "    print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")\n",
+    "    \n",
+    "    # Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
+    "    total_size_bytes = total_params * 4\n",
+    "    \n",
+    "    # Convert to megabytes\n",
+    "    total_size_mb = total_size_bytes / (1024 * 1024)\n",
+    "    \n",
+    "    print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "2587e011-78a4-479c-a8fd-961cc40a5fd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "gpt2-small:\n",
+      "Total number of parameters: 163,009,536\n",
+      "Number of trainable parameters considering weight tying: 124,412,160\n",
+      "Total size of the model: 621.83 MB\n",
+      "\n",
+      "\n",
+      "gpt2-medium:\n",
+      "Total number of parameters: 406,212,608\n",
+      "Number of trainable parameters considering weight tying: 354,749,440\n",
+      "Total size of the model: 1549.58 MB\n",
+      "\n",
+      "\n",
+      "gpt2-large:\n",
+      "Total number of parameters: 838,220,800\n",
+      "Number of trainable parameters considering weight tying: 773,891,840\n",
+      "Total size of the model: 3197.56 MB\n",
+      "\n",
+      "\n",
+      "gpt2-xl:\n",
+      "Total number of parameters: 1,637,792,000\n",
+      "Number of trainable parameters considering weight tying: 1,557,380,800\n",
+      "Total size of the model: 6247.68 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "from gpt import GPTModel\n",
+    "\n",
+    "\n",
+    "for model_abbrev in (\"small\", \"medium\", \"large\", \"xl\"):\n",
+    "    model_name = f\"gpt2-{model_abbrev}\"\n",
+    "    CONFIG = get_config(GPT_CONFIG_124M, model_name=model_name)\n",
+    "    model = GPTModel(CONFIG)\n",
+    "    print(f\"\\n\\n{model_name}:\")\n",
+    "    calculate_size(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5f2306e-5dc8-498e-92ee-70ae7ec37ac1",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.3: Using separate dropout parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate_emb\": 0.1,    # NEW: dropout for embedding layers\n",
+    "    \"drop_rate_ffn\": 0.1,    # NEW: dropout for feed forward module\n",
+    "    \"drop_rate_attn\": 0.1,   # NEW: dropout for multi-head attention  \n",
+    "    \"drop_rate_resid\": 0.1,   # NEW: dropout for residual connections  \n",
+    "    \"qkv_bias\": False\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "5aa1b0c1-d78a-48fc-ad08-4802458b43f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from gpt import MultiHeadAttention, LayerNorm, GELU\n",
+    "\n",
+    "class FeedForward(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.layers = nn.Sequential(\n",
+    "            nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n",
+    "            GELU(),\n",
+    "            nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n",
+    "            nn.Dropout(cfg[\"drop_rate_ffn\"]) # NEW: dropout for feed forward module\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.layers(x)\n",
+    "\n",
+    "\n",
+    "class TransformerBlock(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.att = MultiHeadAttention(\n",
+    "            d_in=cfg[\"emb_dim\"],\n",
+    "            d_out=cfg[\"emb_dim\"],\n",
+    "            block_size=cfg[\"ctx_len\"],\n",
+    "            num_heads=cfg[\"n_heads\"], \n",
+    "            dropout=cfg[\"drop_rate_attn\"], # NEW: dropout for multi-head attention\n",
+    "            qkv_bias=cfg[\"qkv_bias\"])\n",
+    "        self.ff = FeedForward(cfg)\n",
+    "        self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.drop_resid = nn.Dropout(cfg[\"drop_rate_resid\"])\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        # Shortcut connection for attention block\n",
+    "        shortcut = x\n",
+    "        x = self.norm1(x)\n",
+    "        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]\n",
+    "        x = self.drop_resid(x)\n",
+    "        x = x + shortcut  # Add the original input back\n",
+    "\n",
+    "        # Shortcut connection for feed-forward block\n",
+    "        shortcut = x\n",
+    "        x = self.norm2(x)\n",
+    "        x = self.ff(x)\n",
+    "        x = self.drop_resid(x)\n",
+    "        x = x + shortcut  # Add the original input back\n",
+    "\n",
+    "        return x\n",
+    "\n",
+    "\n",
+    "class GPTModel(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
+    "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
+    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate_emb\"]) # NEW: dropout for embedding layers\n",
+    "\n",
+    "        self.trf_blocks = nn.Sequential(\n",
+    "            *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
+    "\n",
+    "        self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False)\n",
+    "\n",
+    "    def forward(self, in_idx):\n",
+    "        batch_size, seq_len = in_idx.shape\n",
+    "        tok_embeds = self.tok_emb(in_idx)\n",
+    "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
+    "        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
+    "        x = self.trf_blocks(x)\n",
+    "        x = self.final_norm(x)\n",
+    "        logits = self.out_head(x)\n",
+    "        return logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "1d013d32-c275-4f42-be21-9010f1537227",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "model = GPTModel(GPT_CONFIG_124M)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,273 @@
+# This file collects all the relevant code that we covered thus far
+# throughout Chapters 2-4
+# This file can be run as a standalone s
+
+import tiktoken
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+
+#####################################
+# Chapter 2
+#####################################
+
+
+class GPTDatasetV1(Dataset):
+    def __init__(self, txt, tokenizer, max_length, stride):
+        self.tokenizer = tokenizer
+        self.input_ids = []
+        self.target_ids = []
+
+        # Tokenize the entire text
+        token_ids = tokenizer.encode(txt)
+
+        # Use a sliding window to chunk the book into overlapping sequences of max_length
+        for i in range(0, len(token_ids) - max_length, stride):
+            input_chunk = token_ids[i:i + max_length]
+            target_chunk = token_ids[i + 1: i + max_length + 1]
+            self.input_ids.append(torch.tensor(input_chunk))
+            self.target_ids.append(torch.tensor(target_chunk))
+
+    def __len__(self):
+        return len(self.input_ids)
+
+    def __getitem__(self, idx):
+        return self.input_ids[idx], self.target_ids[idx]
+
+
+def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True):
+    # Initialize the tokenizer
+    tokenizer = tiktoken.get_encoding("gpt2")
+
+    # Create dataset
+    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
+
+    # Create dataloader
+    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return dataloader
+
+
+#####################################
+# Chapter 3
+#####################################
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):
+        super().__init__()
+        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
+
+        self.d_out = d_out
+        self.num_heads = num_heads
+        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim
+
+        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
+        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
+        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
+        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
+        self.dropout = nn.Dropout(dropout)
+        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1))
+
+    def forward(self, x):
+        b, num_tokens, d_in = x.shape
+
+        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
+        queries = self.W_query(x)
+        values = self.W_value(x)
+
+        # We implicitly split the matrix by adding a `num_heads` dimension
+        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
+        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
+        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
+        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
+
+        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
+        keys = keys.transpose(1, 2)
+        queries = queries.transpose(1, 2)
+        values = values.transpose(1, 2)
+
+        # Compute scaled dot-product attention (aka self-attention) with a causal mask
+        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
+        # Original mask truncated to the number of tokens and converted to boolean
+        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
+        # Unsqueeze the mask twice to match dimensions
+        mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)
+        # Use the unsqueezed mask to fill attention scores
+        attn_scores.masked_fill_(mask_unsqueezed, -torch.inf)
+
+        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
+        attn_weights = self.dropout(attn_weights)
+
+        # Shape: (b, num_tokens, num_heads, head_dim)
+        context_vec = (attn_weights @ values).transpose(1, 2) 
+
+        # Combine heads, where self.d_out = self.num_heads * self.head_dim
+        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
+        context_vec = self.out_proj(context_vec)  # optional projection
+
+        return context_vec
+
+
+#####################################
+# Chapter 4
+#####################################
+class LayerNorm(nn.Module):
+    def __init__(self, emb_dim):
+        super().__init__()
+        self.eps = 1e-5
+        self.scale = nn.Parameter(torch.ones(emb_dim))
+        self.shift = nn.Parameter(torch.zeros(emb_dim))
+
+    def forward(self, x):
+        mean = x.mean(dim=-1, keepdim=True)
+        var = x.var(dim=-1, keepdim=True, unbiased=False)
+        norm_x = (x - mean) / torch.sqrt(var + self.eps)
+        return self.scale * norm_x + self.shift
+
+
+class GELU(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return 0.5 * x * (1 + torch.tanh(
+            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
+            (x + 0.044715 * torch.pow(x, 3))
+        ))
+
+
+class FeedForward(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.layers = nn.Sequential(
+            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
+            GELU(),
+            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
+            nn.Dropout(cfg["drop_rate"])
+        )
+
+    def forward(self, x):
+        return self.layers(x)
+
+
+class TransformerBlock(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.att = MultiHeadAttention(
+            d_in=cfg["emb_dim"],
+            d_out=cfg["emb_dim"],
+            block_size=cfg["ctx_len"],
+            num_heads=cfg["n_heads"], 
+            dropout=cfg["drop_rate"],
+            qkv_bias=cfg["qkv_bias"])
+        self.ff = FeedForward(cfg)
+        self.norm1 = LayerNorm(cfg["emb_dim"])
+        self.norm2 = LayerNorm(cfg["emb_dim"])
+        self.drop_resid = nn.Dropout(cfg["drop_rate"])
+
+    def forward(self, x):
+        # Shortcut connection for attention block
+        shortcut = x
+        x = self.norm1(x)
+        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
+        x = self.drop_resid(x)
+        x = x + shortcut  # Add the original input back
+
+        # Shortcut connection for feed-forward block
+        shortcut = x
+        x = self.norm2(x)
+        x = self.ff(x)
+        x = self.drop_resid(x)
+        x = x + shortcut  # Add the original input back
+
+        return x
+
+
+class GPTModel(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
+        self.pos_emb = nn.Embedding(cfg["ctx_len"], cfg["emb_dim"])
+        self.drop_emb = nn.Dropout(cfg["drop_rate"])
+
+        self.trf_blocks = nn.Sequential(
+            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
+
+        self.final_norm = LayerNorm(cfg["emb_dim"])
+        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
+
+    def forward(self, in_idx):
+        batch_size, seq_len = in_idx.shape
+        tok_embeds = self.tok_emb(in_idx)
+        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
+        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
+        x = self.trf_blocks(x)
+        x = self.final_norm(x)
+        logits = self.out_head(x)
+        return logits
+
+
+def generate_text_simple(model, idx, max_new_tokens, context_size):
+    # idx is (B, T) array of indices in the current context
+    for _ in range(max_new_tokens):
+
+        # Crop current context if it exceeds the supported context size
+        # E.g., if LLM supports only 5 tokens, and the context size is 10
+        # then only the last 5 tokens are used as context
+        idx_cond = idx[:, -context_size:]
+
+        # Get the predictions
+        with torch.no_grad():
+            logits = model(idx_cond)
+
+        # Focus only on the last time step
+        # (batch, n_token, vocab_size) becomes (batch, vocab_size)
+        logits = logits[:, -1, :]  
+
+        # Get the idx of the vocab entry with the highest logits value
+        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch, 1)
+
+        # Append sampled index to the running sequence
+        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)
+
+    return idx
+
+
+if __name__ == "__main__":
+
+    GPT_CONFIG_124M = {
+        "vocab_size": 50257,  # Vocabulary size
+        "ctx_len": 1024,      # Context length
+        "emb_dim": 768,       # Embedding dimension
+        "n_heads": 12,        # Number of attention heads
+        "n_layers": 12,       # Number of layers
+        "drop_rate": 0.1,     # Dropout rate
+        "qkv_bias": False     # Query-Key-Value bias
+    }
+
+    torch.manual_seed(123)
+    model = GPTModel(GPT_CONFIG_124M)
+    model.eval()  # disable dropout
+
+    start_context = "Hello, I am"
+
+    tokenizer = tiktoken.get_encoding("gpt2")
+    encoded = tokenizer.encode(start_context)
+    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
+
+    print(f"\n{50*'='}\n{22*' '}IN\n{50*'='}")
+    print("\nInput text:", start_context)
+    print("Encoded input text:", encoded)
+    print("encoded_tensor.shape:", encoded_tensor.shape)
+
+    out = generate_text_simple(
+        model=model,
+        idx=encoded_tensor,
+        max_new_tokens=10,
+        context_size=GPT_CONFIG_124M["ctx_len"]
+    )
+    decoded_text = tokenizer.decode(out.squeeze(0).tolist())
+
+    print(f"\n\n{50*'='}\n{22*' '}OUT\n{50*'='}")
+    print("\nOutput:", out)
+    print("Output length:", len(out[0]))
+    print("Output text:", decoded_text)
@@ -0,0 +1,6 @@
+# Chapter 4: Implementing a GPT model from Scratch To Generate Text
+
+- [ch04.ipynb](ch04.ipynb) contains all the code as it appears in the chapter
+- [previous_chapters.py](previous_chapters.py) is a Python module that contains the `MultiHeadAttention` module from the previous chapter, which we import in [ch04.ipynb](ch04.ipynb) to create the GPT model
+- [gpt.py](gpt.py) is a standalone Python script file with the code that we implemented thus far, including the GPT model we coded in this chapter
+
@@ -0,0 +1,380 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14",
+   "metadata": {},
+   "source": [
+    "# Chapter 4 Exercise solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.1: Parameters in the feed forward versus attention module"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "2751b0e5-ffd3-4be2-8db3-e20dd4d61d69",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from gpt import TransformerBlock\n",
+    "\n",
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate\": 0.1,\n",
+    "    \"qkv_bias\": False\n",
+    "}\n",
+    "\n",
+    "block = TransformerBlock(GPT_CONFIG_124M)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "1bcaffd1-0cf6-4f8f-bd53-ab88a37f443e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of parameters in feed forward module: 4,722,432\n"
+     ]
+    }
+   ],
+   "source": [
+    "total_params = sum(p.numel() for p in block.ff.parameters())\n",
+    "print(f\"Total number of parameters in feed forward module: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c1dd06c1-ab6c-4df7-ba73-f9cd54b31138",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of parameters in attention module: 2,360,064\n"
+     ]
+    }
+   ],
+   "source": [
+    "total_params = sum(p.numel() for p in block.att.parameters())\n",
+    "print(f\"Total number of parameters in attention module: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15463dec-520a-47b4-b3ad-e180394fd076",
+   "metadata": {},
+   "source": [
+    "- The results above are for a single transformer block\n",
+    "- Optionally multiply by 12 to capture all transformer blocks in the 124M GPT model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f7b7c7f-0fa1-4d30-ab44-e499edd55b6d",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.2: Initialize larger GPT models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "310b2e05-3ec8-47fc-afd9-83bf03d4aad8",
+   "metadata": {},
+   "source": [
+    "- **GPT2-small** (the 124M configuration we already implemented):\n",
+    "    - \"emb_dim\" = 768\n",
+    "    - \"n_layers\" = 12\n",
+    "    - \"n_heads\" = 12\n",
+    "\n",
+    "- **GPT2-medium:**\n",
+    "    - \"emb_dim\" = 1024\n",
+    "    - \"n_layers\" = 24\n",
+    "    - \"n_heads\" = 16\n",
+    "\n",
+    "- **GPT2-large:**\n",
+    "    - \"emb_dim\" = 1280\n",
+    "    - \"n_layers\" = 36\n",
+    "    - \"n_heads\" = 20\n",
+    "\n",
+    "- **GPT2-XL:**\n",
+    "    - \"emb_dim\" = 1600\n",
+    "    - \"n_layers\" = 48\n",
+    "    - \"n_heads\" = 25"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "90185dea-81ca-4cdc-aef7-4aaf95cba946",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate\": 0.1,\n",
+    "    \"qkv_bias\": False\n",
+    "}\n",
+    "\n",
+    "\n",
+    "def get_config(base_config, model_name=\"gpt2-small\"):\n",
+    "    GPT_CONFIG = base_config.copy()\n",
+    "\n",
+    "    if model_name == \"gpt2-small\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 768\n",
+    "        GPT_CONFIG[\"n_layers\"] = 12\n",
+    "        GPT_CONFIG[\"n_heads\"] = 12\n",
+    "\n",
+    "    elif model_name == \"gpt2-medium\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1024\n",
+    "        GPT_CONFIG[\"n_layers\"] = 24\n",
+    "        GPT_CONFIG[\"n_heads\"] = 16\n",
+    "\n",
+    "    elif model_name == \"gpt2-large\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1280\n",
+    "        GPT_CONFIG[\"n_layers\"] = 36\n",
+    "        GPT_CONFIG[\"n_heads\"] = 20\n",
+    "\n",
+    "    elif model_name == \"gpt2-xl\":\n",
+    "        GPT_CONFIG[\"emb_dim\"] = 1600\n",
+    "        GPT_CONFIG[\"n_layers\"] = 48\n",
+    "        GPT_CONFIG[\"n_heads\"] = 25\n",
+    "\n",
+    "    else:\n",
+    "        raise ValueError(f\"Incorrect model name {model_name}\")\n",
+    "\n",
+    "    return GPT_CONFIG\n",
+    "\n",
+    "\n",
+    "def calculate_size(model): # based on chapter code\n",
+    "    \n",
+    "    total_params = sum(p.numel() for p in model.parameters())\n",
+    "    print(f\"Total number of parameters: {total_params:,}\")\n",
+    "\n",
+    "    total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())\n",
+    "    print(f\"Number of trainable parameters considering weight tying: {total_params_gpt2:,}\")\n",
+    "    \n",
+    "    # Calculate the total size in bytes (assuming float32, 4 bytes per parameter)\n",
+    "    total_size_bytes = total_params * 4\n",
+    "    \n",
+    "    # Convert to megabytes\n",
+    "    total_size_mb = total_size_bytes / (1024 * 1024)\n",
+    "    \n",
+    "    print(f\"Total size of the model: {total_size_mb:.2f} MB\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "2587e011-78a4-479c-a8fd-961cc40a5fd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "gpt2-small:\n",
+      "Total number of parameters: 163,009,536\n",
+      "Number of trainable parameters considering weight tying: 124,412,160\n",
+      "Total size of the model: 621.83 MB\n",
+      "\n",
+      "\n",
+      "gpt2-medium:\n",
+      "Total number of parameters: 406,212,608\n",
+      "Number of trainable parameters considering weight tying: 354,749,440\n",
+      "Total size of the model: 1549.58 MB\n",
+      "\n",
+      "\n",
+      "gpt2-large:\n",
+      "Total number of parameters: 838,220,800\n",
+      "Number of trainable parameters considering weight tying: 773,891,840\n",
+      "Total size of the model: 3197.56 MB\n",
+      "\n",
+      "\n",
+      "gpt2-xl:\n",
+      "Total number of parameters: 1,637,792,000\n",
+      "Number of trainable parameters considering weight tying: 1,557,380,800\n",
+      "Total size of the model: 6247.68 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "from gpt import GPTModel\n",
+    "\n",
+    "\n",
+    "for model_abbrev in (\"small\", \"medium\", \"large\", \"xl\"):\n",
+    "    model_name = f\"gpt2-{model_abbrev}\"\n",
+    "    CONFIG = get_config(GPT_CONFIG_124M, model_name=model_name)\n",
+    "    model = GPTModel(CONFIG)\n",
+    "    print(f\"\\n\\n{model_name}:\")\n",
+    "    calculate_size(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5f2306e-5dc8-498e-92ee-70ae7ec37ac1",
+   "metadata": {},
+   "source": [
+    "# Exercise 4.3: Using separate dropout parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "5fee2cf5-61c3-4167-81b5-44ea155bbaf2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GPT_CONFIG_124M = {\n",
+    "    \"vocab_size\": 50257,\n",
+    "    \"ctx_len\": 1024,\n",
+    "    \"emb_dim\": 768,\n",
+    "    \"n_heads\": 12,\n",
+    "    \"n_layers\": 12,\n",
+    "    \"drop_rate_emb\": 0.1,    # NEW: dropout for embedding layers\n",
+    "    \"drop_rate_ffn\": 0.1,    # NEW: dropout for feed forward module\n",
+    "    \"drop_rate_attn\": 0.1,   # NEW: dropout for multi-head attention  \n",
+    "    \"drop_rate_resid\": 0.1,   # NEW: dropout for residual connections  \n",
+    "    \"qkv_bias\": False\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "5aa1b0c1-d78a-48fc-ad08-4802458b43f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from gpt import MultiHeadAttention, LayerNorm, GELU\n",
+    "\n",
+    "class FeedForward(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.layers = nn.Sequential(\n",
+    "            nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n",
+    "            GELU(),\n",
+    "            nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n",
+    "            nn.Dropout(cfg[\"drop_rate_ffn\"]) # NEW: dropout for feed forward module\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.layers(x)\n",
+    "\n",
+    "\n",
+    "class TransformerBlock(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.att = MultiHeadAttention(\n",
+    "            d_in=cfg[\"emb_dim\"],\n",
+    "            d_out=cfg[\"emb_dim\"],\n",
+    "            block_size=cfg[\"ctx_len\"],\n",
+    "            num_heads=cfg[\"n_heads\"], \n",
+    "            dropout=cfg[\"drop_rate_attn\"], # NEW: dropout for multi-head attention\n",
+    "            qkv_bias=cfg[\"qkv_bias\"])\n",
+    "        self.ff = FeedForward(cfg)\n",
+    "        self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.drop_resid = nn.Dropout(cfg[\"drop_rate_resid\"])\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        # Shortcut connection for attention block\n",
+    "        shortcut = x\n",
+    "        x = self.norm1(x)\n",
+    "        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]\n",
+    "        x = self.drop_resid(x)\n",
+    "        x = x + shortcut  # Add the original input back\n",
+    "\n",
+    "        # Shortcut connection for feed-forward block\n",
+    "        shortcut = x\n",
+    "        x = self.norm2(x)\n",
+    "        x = self.ff(x)\n",
+    "        x = self.drop_resid(x)\n",
+    "        x = x + shortcut  # Add the original input back\n",
+    "\n",
+    "        return x\n",
+    "\n",
+    "\n",
+    "class GPTModel(nn.Module):\n",
+    "    def __init__(self, cfg):\n",
+    "        super().__init__()\n",
+    "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
+    "        self.pos_emb = nn.Embedding(cfg[\"ctx_len\"], cfg[\"emb_dim\"])\n",
+    "        self.drop_emb = nn.Dropout(cfg[\"drop_rate_emb\"]) # NEW: dropout for embedding layers\n",
+    "\n",
+    "        self.trf_blocks = nn.Sequential(\n",
+    "            *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
+    "\n",
+    "        self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
+    "        self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False)\n",
+    "\n",
+    "    def forward(self, in_idx):\n",
+    "        batch_size, seq_len = in_idx.shape\n",
+    "        tok_embeds = self.tok_emb(in_idx)\n",
+    "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
+    "        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
+    "        x = self.trf_blocks(x)\n",
+    "        x = self.final_norm(x)\n",
+    "        logits = self.out_head(x)\n",
+    "        return logits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "1d013d32-c275-4f42-be21-9010f1537227",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "torch.manual_seed(123)\n",
+    "model = GPTModel(GPT_CONFIG_124M)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/Show More
+++ b/Show More