mirror of
https://github.com/datawhalechina/llms-from-scratch-cn.git
synced 2026-01-13 16:57:18 +08:00
add the code of ch07
This commit is contained in:
parent
3076c492ba
commit
8e5301e4ed
76
Codes/ch07/01_main-chapter-code/README.md
Normal file
76
Codes/ch07/01_main-chapter-code/README.md
Normal file
@ -0,0 +1,76 @@
|
||||
# Chapter 7: Finetuning to Follow Instructions
|
||||
|
||||
### Main Chapter Code
|
||||
|
||||
- [ch07.ipynb](ch07.ipynb) contains all the code as it appears in the chapter
|
||||
- [previous_chapters.py](previous_chapters.py) is a Python module that contains the GPT model we coded and trained in previous chapters, alongside many utility functions, which we reuse in this chapter
|
||||
- [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights
|
||||
- [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter
|
||||
|
||||
|
||||
### Optional Code
|
||||
|
||||
- [load-finetuned-model.ipynb](load-finetuned-model.ipynb) is a standalone Jupyter notebook to load the instruction finetuned model we created in this chapter
|
||||
|
||||
- [gpt_instruction_finetuning.py](gpt_instruction_finetuning.py) is a standalone Python script to instruction finetune the model as described in the main chapter (think of it as a chapter summary focused on the finetuning parts)
|
||||
|
||||
Usage:
|
||||
|
||||
```bash
|
||||
python gpt_instruction_finetuning.py
|
||||
```
|
||||
|
||||
```
|
||||
matplotlib version: 3.9.0
|
||||
tiktoken version: 0.7.0
|
||||
torch version: 2.3.1
|
||||
tqdm version: 4.66.4
|
||||
tensorflow version: 2.16.1
|
||||
--------------------------------------------------
|
||||
Training set length: 935
|
||||
Validation set length: 55
|
||||
Test set length: 110
|
||||
--------------------------------------------------
|
||||
Device: cpu
|
||||
--------------------------------------------------
|
||||
File already exists and is up-to-date: gpt2/355M/checkpoint
|
||||
File already exists and is up-to-date: gpt2/355M/encoder.json
|
||||
File already exists and is up-to-date: gpt2/355M/hparams.json
|
||||
File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001
|
||||
File already exists and is up-to-date: gpt2/355M/model.ckpt.index
|
||||
File already exists and is up-to-date: gpt2/355M/model.ckpt.meta
|
||||
File already exists and is up-to-date: gpt2/355M/vocab.bpe
|
||||
Loaded model: gpt2-medium (355M)
|
||||
--------------------------------------------------
|
||||
Initial losses
|
||||
Training loss: 3.839039182662964
|
||||
Validation loss: 3.7619192123413088
|
||||
Ep 1 (Step 000000): Train loss 2.611, Val loss 2.668
|
||||
Ep 1 (Step 000005): Train loss 1.161, Val loss 1.131
|
||||
Ep 1 (Step 000010): Train loss 0.939, Val loss 0.973
|
||||
...
|
||||
Training completed in 15.66 minutes.
|
||||
Plot saved as loss-plot-standalone.pdf
|
||||
--------------------------------------------------
|
||||
Generating responses
|
||||
100%|█████████████████████████████████████████████████████████| 110/110 [06:57<00:00, 3.80s/it]
|
||||
Responses saved as instruction-data-with-response-standalone.json
|
||||
Model saved as gpt2-medium355M-sft-standalone.pth
|
||||
```
|
||||
|
||||
- [ollama_evaluate.py](ollama_evaluate.py) is a standalone Python script to evaluate the responses of the finetuned model as described in the main chapter (think of it as a chapter summary focused on the evaluation parts)
|
||||
|
||||
Usage:
|
||||
|
||||
```bash
|
||||
python ollama_evaluate.py --file_path instruction-data-with-response-standalone.json
|
||||
```
|
||||
|
||||
```
|
||||
Ollama running: True
|
||||
Scoring entries: 100%|███████████████████████████████████████| 110/110 [01:08<00:00, 1.62it/s]
|
||||
Number of scores: 110 of 110
|
||||
Average score: 51.75
|
||||
```
|
||||
|
||||
- [exercise_experiments.py](exercise_experiments.py) is an optional scropt that implements the exercise solutions; for more details see [exercise-solutions.ipynb](exercise-solutions.ipynb)
|
||||
2795
Codes/ch07/01_main-chapter-code/ch07.ipynb
Normal file
2795
Codes/ch07/01_main-chapter-code/ch07.ipynb
Normal file
File diff suppressed because one or more lines are too long
1002
Codes/ch07/01_main-chapter-code/exercise-solutions.ipynb
Normal file
1002
Codes/ch07/01_main-chapter-code/exercise-solutions.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
562
Codes/ch07/01_main-chapter-code/exercise_experiments.py
Normal file
562
Codes/ch07/01_main-chapter-code/exercise_experiments.py
Normal file
@ -0,0 +1,562 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
#
|
||||
# Code to run the exercises; see exercise-solutions.ipynb for more information
|
||||
|
||||
from functools import partial
|
||||
from importlib.metadata import version
|
||||
import json
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import tiktoken
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from tqdm import tqdm
|
||||
|
||||
# Import from local files in this folder
|
||||
from gpt_download import download_and_load_gpt2
|
||||
from previous_chapters import (
|
||||
calc_loss_loader,
|
||||
generate,
|
||||
GPTModel,
|
||||
load_weights_into_gpt,
|
||||
text_to_token_ids,
|
||||
train_model_simple,
|
||||
token_ids_to_text
|
||||
)
|
||||
|
||||
|
||||
class InstructionDataset(Dataset):
|
||||
def __init__(self, data, tokenizer):
|
||||
self.data = data
|
||||
|
||||
# Pre-tokenize texts
|
||||
self.encoded_texts = []
|
||||
for entry in data:
|
||||
instruction_plus_input = format_input(entry)
|
||||
response_text = f"\n\n### Response:\n{entry['output']}"
|
||||
full_text = instruction_plus_input + response_text
|
||||
self.encoded_texts.append(
|
||||
tokenizer.encode(full_text)
|
||||
)
|
||||
|
||||
def __getitem__(self, index):
|
||||
return self.encoded_texts[index]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
|
||||
class InstructionDatasetWithMasking(Dataset):
|
||||
def __init__(self, data, tokenizer):
|
||||
self.data = data
|
||||
|
||||
# New: Separate list for instruction lengths
|
||||
self.instruction_lengths = []
|
||||
self.encoded_texts = []
|
||||
|
||||
for entry in data:
|
||||
instruction_plus_input = format_input(entry)
|
||||
response_text = f"\n\n### Response:\n{entry['output']}"
|
||||
full_text = instruction_plus_input + response_text
|
||||
|
||||
self.encoded_texts.append(
|
||||
tokenizer.encode(full_text)
|
||||
)
|
||||
|
||||
# New: collect instruction lengths
|
||||
instruction_length = len(tokenizer.encode(instruction_plus_input))
|
||||
self.instruction_lengths.append(instruction_length)
|
||||
|
||||
def __getitem__(self, index):
|
||||
# New: return both instruction lengths and texts separately
|
||||
return self.instruction_lengths[index], self.encoded_texts[index]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
|
||||
class InstructionDatasetPhi(Dataset):
|
||||
def __init__(self, data, tokenizer):
|
||||
self.data = data
|
||||
|
||||
# Pre-tokenize texts
|
||||
self.encoded_texts = []
|
||||
for entry in data:
|
||||
|
||||
###################################################################
|
||||
# NEW: Use `format_input_phi` and adjust the response text template
|
||||
instruction_plus_input = format_input_phi(entry)
|
||||
response_text = f"\n<|assistant|>:\n{entry['output']}"
|
||||
###################################################################
|
||||
full_text = instruction_plus_input + response_text
|
||||
self.encoded_texts.append(
|
||||
tokenizer.encode(full_text)
|
||||
)
|
||||
|
||||
def __getitem__(self, index):
|
||||
return self.encoded_texts[index]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
|
||||
class LinearWithLoRA(torch.nn.Module):
|
||||
def __init__(self, linear, rank, alpha):
|
||||
super().__init__()
|
||||
self.linear = linear
|
||||
self.lora = LoRALayer(
|
||||
linear.in_features, linear.out_features, rank, alpha
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear(x) + self.lora(x)
|
||||
|
||||
|
||||
class LoRALayer(torch.nn.Module):
|
||||
def __init__(self, in_dim, out_dim, rank, alpha):
|
||||
super().__init__()
|
||||
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
|
||||
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) # similar to standard weight initialization
|
||||
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
|
||||
self.alpha = alpha
|
||||
|
||||
def forward(self, x):
|
||||
x = self.alpha * (x @ self.A @ self.B)
|
||||
return x
|
||||
|
||||
|
||||
def replace_linear_with_lora(model, rank, alpha):
|
||||
for name, module in model.named_children():
|
||||
if isinstance(module, torch.nn.Linear):
|
||||
# Replace the Linear layer with LinearWithLoRA
|
||||
setattr(model, name, LinearWithLoRA(module, rank, alpha))
|
||||
else:
|
||||
# Recursively apply the same function to child modules
|
||||
replace_linear_with_lora(module, rank, alpha)
|
||||
|
||||
|
||||
def custom_collate_fn(
|
||||
batch,
|
||||
pad_token_id=50256,
|
||||
ignore_index=-100,
|
||||
allowed_max_length=None,
|
||||
device="cpu"
|
||||
):
|
||||
# Find the longest sequence in the batch
|
||||
batch_max_length = max(len(item)+1 for item in batch)
|
||||
|
||||
# Pad and prepare inputs and targets
|
||||
inputs_lst, targets_lst = [], []
|
||||
|
||||
for item in batch:
|
||||
new_item = item.copy()
|
||||
# Add an <|endoftext|> token
|
||||
new_item += [pad_token_id]
|
||||
# Pad sequences to max_length
|
||||
padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
|
||||
inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
|
||||
targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
|
||||
|
||||
# New: Replace all but the first padding tokens in targets by ignore_index
|
||||
mask = targets == pad_token_id
|
||||
indices = torch.nonzero(mask).squeeze()
|
||||
if indices.numel() > 1:
|
||||
targets[indices[1:]] = ignore_index
|
||||
|
||||
# New: Optionally truncate to maximum sequence length
|
||||
if allowed_max_length is not None:
|
||||
inputs = inputs[:allowed_max_length]
|
||||
targets = targets[:allowed_max_length]
|
||||
|
||||
inputs_lst.append(inputs)
|
||||
targets_lst.append(targets)
|
||||
|
||||
# Convert list of inputs and targets to tensors and transfer to target device
|
||||
inputs_tensor = torch.stack(inputs_lst).to(device)
|
||||
targets_tensor = torch.stack(targets_lst).to(device)
|
||||
|
||||
return inputs_tensor, targets_tensor
|
||||
|
||||
|
||||
def custom_collate_with_masking_fn(
|
||||
batch,
|
||||
pad_token_id=50256,
|
||||
ignore_index=-100,
|
||||
allowed_max_length=None,
|
||||
device="cpu"
|
||||
):
|
||||
# Find the longest sequence in the batch
|
||||
batch_max_length = max(len(item)+1 for instruction_length, item in batch) # New: batch is now a tuple
|
||||
|
||||
# Pad and prepare inputs and targets
|
||||
inputs_lst, targets_lst = [], []
|
||||
|
||||
for instruction_length, item in batch: # New: batch is now a tuple
|
||||
new_item = item.copy()
|
||||
# Add an <|endoftext|> token
|
||||
new_item += [pad_token_id]
|
||||
# Pad sequences to max_length
|
||||
padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
|
||||
inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
|
||||
targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
|
||||
|
||||
# Replace all but the first padding tokens in targets by ignore_index
|
||||
mask = targets == pad_token_id
|
||||
indices = torch.nonzero(mask).squeeze()
|
||||
if indices.numel() > 1:
|
||||
targets[indices[1:]] = ignore_index
|
||||
|
||||
# New: Mask all input and instruction tokens in the targets
|
||||
targets[:instruction_length-1] = -100
|
||||
|
||||
# Optionally truncate to maximum sequence length
|
||||
if allowed_max_length is not None:
|
||||
inputs = inputs[:allowed_max_length]
|
||||
targets = targets[:allowed_max_length]
|
||||
|
||||
inputs_lst.append(inputs)
|
||||
targets_lst.append(targets)
|
||||
|
||||
# Convert list of inputs and targets to tensors and transfer to target device
|
||||
inputs_tensor = torch.stack(inputs_lst).to(device)
|
||||
targets_tensor = torch.stack(targets_lst).to(device)
|
||||
|
||||
return inputs_tensor, targets_tensor
|
||||
|
||||
|
||||
def download_and_load_file(file_path, url):
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
with urllib.request.urlopen(url) as response:
|
||||
text_data = response.read().decode("utf-8")
|
||||
with open(file_path, "w", encoding="utf-8") as file:
|
||||
file.write(text_data)
|
||||
else:
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
text_data = file.read()
|
||||
|
||||
with open(file_path, "r") as file:
|
||||
data = json.load(file)
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def format_input_phi(entry):
|
||||
instruction_text = (
|
||||
f"<|user|>\n{entry['instruction']}"
|
||||
)
|
||||
|
||||
input_text = f"\n{entry['input']}" if entry["input"] else ""
|
||||
|
||||
return instruction_text + input_text
|
||||
|
||||
|
||||
def format_input(entry):
|
||||
instruction_text = (
|
||||
f"Below is an instruction that describes a task. "
|
||||
f"Write a response that appropriately completes the request."
|
||||
f"\n\n### Instruction:\n{entry['instruction']}"
|
||||
)
|
||||
|
||||
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
|
||||
|
||||
return instruction_text + input_text
|
||||
|
||||
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses, plot_name):
|
||||
fig, ax1 = plt.subplots(figsize=(12, 6))
|
||||
|
||||
# Plot training and validation loss against epochs
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) # only show integer labels on x-axis
|
||||
|
||||
# Create a second x-axis for tokens seen
|
||||
ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
|
||||
fig.tight_layout() # Adjust layout to make room
|
||||
print(f"Plot saved as {plot_name}")
|
||||
plt.savefig(plot_name)
|
||||
# plt.show()
|
||||
|
||||
|
||||
def main(mask_instructions=False, alpaca52k=False, phi3_prompt=False, lora=False):
|
||||
#######################################
|
||||
# Print package versions
|
||||
#######################################
|
||||
print()
|
||||
pkgs = [
|
||||
"matplotlib", # Plotting library
|
||||
"tiktoken", # Tokenizer
|
||||
"torch", # Deep learning library
|
||||
"tqdm", # Progress bar
|
||||
"tensorflow", # For OpenAI's pretrained weights
|
||||
]
|
||||
for p in pkgs:
|
||||
print(f"{p} version: {version(p)}")
|
||||
print(50*"-")
|
||||
|
||||
#######################################
|
||||
# Download and prepare dataset
|
||||
#######################################
|
||||
file_path = "instruction-data.json"
|
||||
|
||||
if alpaca52k:
|
||||
url = "https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json"
|
||||
else:
|
||||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"
|
||||
data = download_and_load_file(file_path, url)
|
||||
|
||||
train_portion = int(len(data) * 0.85) # 85% for training
|
||||
test_portion = int(len(data) * 0.1) # 10% for testing
|
||||
|
||||
train_data = data[:train_portion]
|
||||
test_data = data[train_portion:train_portion + test_portion]
|
||||
val_data = data[train_portion + test_portion:]
|
||||
|
||||
print("Training set length:", len(train_data))
|
||||
print("Validation set length:", len(val_data))
|
||||
print("Test set length:", len(test_data))
|
||||
print(50*"-")
|
||||
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
print("Device:", device)
|
||||
print(50*"-")
|
||||
|
||||
if alpaca52k:
|
||||
allowed_max_length = 512
|
||||
else:
|
||||
allowed_max_length = 1024
|
||||
|
||||
if mask_instructions and phi3_prompt:
|
||||
raise ValueError("Simultaneous support for instruction masking and the Phi-3 prompt template has not been implemented, yet.")
|
||||
|
||||
if mask_instructions:
|
||||
customized_collate_fn = partial(custom_collate_with_masking_fn, device=device, allowed_max_length=allowed_max_length)
|
||||
CustomDataset = InstructionDatasetWithMasking
|
||||
elif phi3_prompt:
|
||||
customized_collate_fn = partial(custom_collate_fn, device=device, allowed_max_length=allowed_max_length)
|
||||
CustomDataset = InstructionDatasetPhi
|
||||
else:
|
||||
customized_collate_fn = partial(custom_collate_fn, device=device, allowed_max_length=allowed_max_length)
|
||||
CustomDataset = InstructionDataset
|
||||
|
||||
num_workers = 0
|
||||
|
||||
if alpaca52k:
|
||||
batch_size = 4
|
||||
else:
|
||||
batch_size = 8
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_dataset = CustomDataset(train_data, tokenizer)
|
||||
train_loader = DataLoader(
|
||||
train_dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=customized_collate_fn,
|
||||
shuffle=True,
|
||||
drop_last=True,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
val_dataset = CustomDataset(val_data, tokenizer)
|
||||
val_loader = DataLoader(
|
||||
val_dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=customized_collate_fn,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
#######################################
|
||||
# Load pretrained model
|
||||
#######################################
|
||||
BASE_CONFIG = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 1024, # Context length
|
||||
"drop_rate": 0.0, # Dropout rate
|
||||
"qkv_bias": True # Query-key-value bias
|
||||
}
|
||||
|
||||
model_configs = {
|
||||
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
|
||||
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
|
||||
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
|
||||
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
|
||||
}
|
||||
|
||||
CHOOSE_MODEL = "gpt2-medium (355M)"
|
||||
|
||||
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
|
||||
|
||||
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
|
||||
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")
|
||||
|
||||
model = GPTModel(BASE_CONFIG)
|
||||
load_weights_into_gpt(model, params)
|
||||
model.eval()
|
||||
model.to(device)
|
||||
|
||||
print("Loaded model:", CHOOSE_MODEL)
|
||||
print(50*"-")
|
||||
|
||||
if lora:
|
||||
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
||||
print(f"Total trainable parameters before: {total_params:,}")
|
||||
|
||||
for param in model.parameters():
|
||||
param.requires_grad = False
|
||||
|
||||
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
||||
print(f"Total trainable parameters after: {total_params:,}")
|
||||
replace_linear_with_lora(model, rank=16, alpha=16)
|
||||
|
||||
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
||||
print(f"Total trainable LoRA parameters: {total_params:,}")
|
||||
model.to(device)
|
||||
|
||||
#######################################
|
||||
# Finetuning the model
|
||||
#######################################
|
||||
print("Initial losses")
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
|
||||
|
||||
print(" Training loss:", train_loss)
|
||||
print(" Validation loss:", val_loss)
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
num_epochs = 2
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
start_context = format_input_phi(val_data[0]) if phi3_prompt else format_input(val_data[0])
|
||||
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context=start_context, tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
|
||||
plot_name = "loss-plot.pdf"
|
||||
if mask_instructions:
|
||||
plot_name = plot_name.replace(".pdf", "-mask-instructions.pdf")
|
||||
if alpaca52k:
|
||||
plot_name = plot_name.replace(".pdf", "-alpaca52k.pdf")
|
||||
if phi3_prompt:
|
||||
plot_name = plot_name.replace(".pdf", "-phi3-prompt.pdf")
|
||||
if lora:
|
||||
plot_name = plot_name.replace(".pdf", "-lora.pdf")
|
||||
if not any([mask_instructions, alpaca52k, phi3_prompt, lora]):
|
||||
plot_name = plot_name.replace(".pdf", "-baseline.pdf")
|
||||
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses, plot_name)
|
||||
print(50*"-")
|
||||
|
||||
#######################################
|
||||
# Saving results
|
||||
#######################################
|
||||
print("Generating responses")
|
||||
for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
|
||||
|
||||
input_text = format_input_phi(entry) if phi3_prompt else format_input(entry)
|
||||
|
||||
token_ids = generate(
|
||||
model=model,
|
||||
idx=text_to_token_ids(input_text, tokenizer).to(device),
|
||||
max_new_tokens=256,
|
||||
context_size=BASE_CONFIG["context_length"],
|
||||
eos_id=50256
|
||||
)
|
||||
generated_text = token_ids_to_text(token_ids, tokenizer)
|
||||
|
||||
if phi3_prompt:
|
||||
response_text = generated_text[len(input_text):].replace("<|assistant|>:", "").strip()
|
||||
else:
|
||||
response_text = generated_text[len(input_text):].replace("### Response:", "").strip()
|
||||
|
||||
test_data[i]["model_response"] = response_text
|
||||
|
||||
test_data_path = "instruction-data-with-response.json"
|
||||
file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
|
||||
|
||||
if mask_instructions:
|
||||
test_data_path = test_data_path.replace(".json", "-mask-instructions.json")
|
||||
file_name = file_name.replace(".pth", "-mask-instructions.pth")
|
||||
if alpaca52k:
|
||||
test_data_path = test_data_path.replace(".json", "-alpaca52k.json")
|
||||
file_name = file_name.replace(".pth", "-alpaca52k.pth")
|
||||
if phi3_prompt:
|
||||
test_data_path = test_data_path.replace(".json", "-phi3-prompt.json")
|
||||
file_name = file_name.replace(".pth", "-phi3-prompt.pth")
|
||||
if lora:
|
||||
test_data_path = test_data_path.replace(".json", "-lora.json")
|
||||
file_name = file_name.replace(".pth", "-lora.pth")
|
||||
if not any([mask_instructions, alpaca52k, phi3_prompt, lora]):
|
||||
test_data_path = test_data_path.replace(".json", "-baseline.json")
|
||||
file_name = file_name.replace(".pth", "-baseline.pth")
|
||||
|
||||
with open(test_data_path, "w") as file:
|
||||
json.dump(test_data, file, indent=4) # "indent" for pretty-printing
|
||||
print(f"Responses saved as {test_data_path}")
|
||||
|
||||
torch.save(model.state_dict(), file_name)
|
||||
print(f"Model saved as {file_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Instruction finetune a GPT model"
|
||||
)
|
||||
options = {"baseline", "mask_instructions", "alpaca_52k", "phi3_prompt", "lora"}
|
||||
parser.add_argument(
|
||||
"--exercise_solution",
|
||||
type=str,
|
||||
default="last_block",
|
||||
help=(
|
||||
f"Which experiment to run. Options: {options}."
|
||||
)
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.exercise_solution == "baseline":
|
||||
main()
|
||||
elif args.exercise_solution == "mask_instructions":
|
||||
main(mask_instructions=True)
|
||||
elif args.exercise_solution == "alpaca_52k":
|
||||
main(alpaca52k=True)
|
||||
elif args.exercise_solution == "phi3_prompt":
|
||||
main(phi3_prompt=True)
|
||||
elif args.exercise_solution == "lora":
|
||||
main(lora=True)
|
||||
else:
|
||||
raise ValueError(f"{args.exercise_solution} is not a valid --args.exercise_solution option. Options: {options}")
|
||||
142
Codes/ch07/01_main-chapter-code/gpt_download.py
Normal file
142
Codes/ch07/01_main-chapter-code/gpt_download.py
Normal file
@ -0,0 +1,142 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
|
||||
|
||||
import os
|
||||
import urllib.request
|
||||
|
||||
# import requests
|
||||
import json
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
def download_and_load_gpt2(model_size, models_dir):
|
||||
# Validate model size
|
||||
allowed_sizes = ("124M", "355M", "774M", "1558M")
|
||||
if model_size not in allowed_sizes:
|
||||
raise ValueError(f"Model size not in {allowed_sizes}")
|
||||
|
||||
# Define paths
|
||||
model_dir = os.path.join(models_dir, model_size)
|
||||
base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
|
||||
filenames = [
|
||||
"checkpoint", "encoder.json", "hparams.json",
|
||||
"model.ckpt.data-00000-of-00001", "model.ckpt.index",
|
||||
"model.ckpt.meta", "vocab.bpe"
|
||||
]
|
||||
|
||||
# Download files
|
||||
os.makedirs(model_dir, exist_ok=True)
|
||||
for filename in filenames:
|
||||
file_url = os.path.join(base_url, model_size, filename)
|
||||
file_path = os.path.join(model_dir, filename)
|
||||
download_file(file_url, file_path)
|
||||
|
||||
# Load settings and params
|
||||
tf_ckpt_path = tf.train.latest_checkpoint(model_dir)
|
||||
settings = json.load(open(os.path.join(model_dir, "hparams.json")))
|
||||
params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings)
|
||||
|
||||
return settings, params
|
||||
|
||||
|
||||
def download_file(url, destination):
|
||||
# Send a GET request to download the file
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
# Get the total file size from headers, defaulting to 0 if not present
|
||||
file_size = int(response.headers.get("Content-Length", 0))
|
||||
|
||||
# Check if file exists and has the same size
|
||||
if os.path.exists(destination):
|
||||
file_size_local = os.path.getsize(destination)
|
||||
if file_size == file_size_local:
|
||||
print(f"File already exists and is up-to-date: {destination}")
|
||||
return
|
||||
|
||||
# Define the block size for reading the file
|
||||
block_size = 1024 # 1 Kilobyte
|
||||
|
||||
# Initialize the progress bar with total file size
|
||||
progress_bar_description = os.path.basename(url) # Extract filename from URL
|
||||
with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar:
|
||||
# Open the destination file in binary write mode
|
||||
with open(destination, "wb") as file:
|
||||
# Read the file in chunks and write to destination
|
||||
while True:
|
||||
chunk = response.read(block_size)
|
||||
if not chunk:
|
||||
break
|
||||
file.write(chunk)
|
||||
progress_bar.update(len(chunk)) # Update progress bar
|
||||
except urllib.error.HTTPError:
|
||||
s = (
|
||||
f"The specified URL ({url}) is incorrect, the internet connection cannot be established,"
|
||||
"\nor the requested file is temporarily unavailable.\nPlease visit the following website"
|
||||
" for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273")
|
||||
print(s)
|
||||
|
||||
|
||||
# Alternative way using `requests`
|
||||
"""
|
||||
def download_file(url, destination):
|
||||
# Send a GET request to download the file in streaming mode
|
||||
response = requests.get(url, stream=True)
|
||||
|
||||
# Get the total file size from headers, defaulting to 0 if not present
|
||||
file_size = int(response.headers.get("content-length", 0))
|
||||
|
||||
# Check if file exists and has the same size
|
||||
if os.path.exists(destination):
|
||||
file_size_local = os.path.getsize(destination)
|
||||
if file_size == file_size_local:
|
||||
print(f"File already exists and is up-to-date: {destination}")
|
||||
return
|
||||
|
||||
# Define the block size for reading the file
|
||||
block_size = 1024 # 1 Kilobyte
|
||||
|
||||
# Initialize the progress bar with total file size
|
||||
progress_bar_description = url.split("/")[-1] # Extract filename from URL
|
||||
with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar:
|
||||
# Open the destination file in binary write mode
|
||||
with open(destination, "wb") as file:
|
||||
# Iterate over the file data in chunks
|
||||
for chunk in response.iter_content(block_size):
|
||||
progress_bar.update(len(chunk)) # Update progress bar
|
||||
file.write(chunk) # Write the chunk to the file
|
||||
"""
|
||||
|
||||
|
||||
def load_gpt2_params_from_tf_ckpt(ckpt_path, settings):
|
||||
# Initialize parameters dictionary with empty blocks for each layer
|
||||
params = {"blocks": [{} for _ in range(settings["n_layer"])]}
|
||||
|
||||
# Iterate over each variable in the checkpoint
|
||||
for name, _ in tf.train.list_variables(ckpt_path):
|
||||
# Load the variable and remove singleton dimensions
|
||||
variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name))
|
||||
|
||||
# Process the variable name to extract relevant parts
|
||||
variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix
|
||||
|
||||
# Identify the target dictionary for the variable
|
||||
target_dict = params
|
||||
if variable_name_parts[0].startswith("h"):
|
||||
layer_number = int(variable_name_parts[0][1:])
|
||||
target_dict = params["blocks"][layer_number]
|
||||
|
||||
# Recursively access or create nested dictionaries
|
||||
for key in variable_name_parts[1:-1]:
|
||||
target_dict = target_dict.setdefault(key, {})
|
||||
|
||||
# Assign the variable array to the last key
|
||||
last_key = variable_name_parts[-1]
|
||||
target_dict[last_key] = variable_array
|
||||
|
||||
return params
|
||||
351
Codes/ch07/01_main-chapter-code/gpt_instruction_finetuning.py
Normal file
351
Codes/ch07/01_main-chapter-code/gpt_instruction_finetuning.py
Normal file
@ -0,0 +1,351 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
#
|
||||
# A minimal instruction finetuning file based on the code in chapter 7
|
||||
|
||||
from functools import partial
|
||||
from importlib.metadata import version
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import tiktoken
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from tqdm import tqdm
|
||||
|
||||
# Import from local files in this folder
|
||||
from gpt_download import download_and_load_gpt2
|
||||
from previous_chapters import (
|
||||
calc_loss_loader,
|
||||
generate,
|
||||
GPTModel,
|
||||
load_weights_into_gpt,
|
||||
text_to_token_ids,
|
||||
train_model_simple,
|
||||
token_ids_to_text
|
||||
)
|
||||
|
||||
|
||||
class InstructionDataset(Dataset):
|
||||
def __init__(self, data, tokenizer):
|
||||
self.data = data
|
||||
|
||||
# Pre-tokenize texts
|
||||
self.encoded_texts = []
|
||||
for entry in data:
|
||||
instruction_plus_input = format_input(entry)
|
||||
response_text = f"\n\n### Response:\n{entry['output']}"
|
||||
full_text = instruction_plus_input + response_text
|
||||
self.encoded_texts.append(
|
||||
tokenizer.encode(full_text)
|
||||
)
|
||||
|
||||
def __getitem__(self, index):
|
||||
return self.encoded_texts[index]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
|
||||
def custom_collate_fn(
|
||||
batch,
|
||||
pad_token_id=50256,
|
||||
ignore_index=-100,
|
||||
allowed_max_length=None,
|
||||
device="cpu"
|
||||
):
|
||||
# Find the longest sequence in the batch
|
||||
batch_max_length = max(len(item)+1 for item in batch)
|
||||
|
||||
# Pad and prepare inputs and targets
|
||||
inputs_lst, targets_lst = [], []
|
||||
|
||||
for item in batch:
|
||||
new_item = item.copy()
|
||||
# Add an <|endoftext|> token
|
||||
new_item += [pad_token_id]
|
||||
# Pad sequences to max_length
|
||||
padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
|
||||
inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
|
||||
targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
|
||||
|
||||
# New: Replace all but the first padding tokens in targets by ignore_index
|
||||
mask = targets == pad_token_id
|
||||
indices = torch.nonzero(mask).squeeze()
|
||||
if indices.numel() > 1:
|
||||
targets[indices[1:]] = ignore_index
|
||||
|
||||
# New: Optionally truncate to maximum sequence length
|
||||
if allowed_max_length is not None:
|
||||
inputs = inputs[:allowed_max_length]
|
||||
targets = targets[:allowed_max_length]
|
||||
|
||||
inputs_lst.append(inputs)
|
||||
targets_lst.append(targets)
|
||||
|
||||
# Convert list of inputs and targets to tensors and transfer to target device
|
||||
inputs_tensor = torch.stack(inputs_lst).to(device)
|
||||
targets_tensor = torch.stack(targets_lst).to(device)
|
||||
|
||||
return inputs_tensor, targets_tensor
|
||||
|
||||
|
||||
def download_and_load_file(file_path, url):
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
with urllib.request.urlopen(url) as response:
|
||||
text_data = response.read().decode("utf-8")
|
||||
with open(file_path, "w", encoding="utf-8") as file:
|
||||
file.write(text_data)
|
||||
else:
|
||||
with open(file_path, "r", encoding="utf-8") as file:
|
||||
text_data = file.read()
|
||||
|
||||
with open(file_path, "r") as file:
|
||||
data = json.load(file)
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def format_input(entry):
|
||||
instruction_text = (
|
||||
f"Below is an instruction that describes a task. "
|
||||
f"Write a response that appropriately completes the request."
|
||||
f"\n\n### Instruction:\n{entry['instruction']}"
|
||||
)
|
||||
|
||||
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
|
||||
|
||||
return instruction_text + input_text
|
||||
|
||||
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(12, 6))
|
||||
|
||||
# Plot training and validation loss against epochs
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
|
||||
# Create a second x-axis for tokens seen
|
||||
ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
|
||||
fig.tight_layout() # Adjust layout to make room
|
||||
plot_name = "loss-plot-standalone.pdf"
|
||||
print(f"Plot saved as {plot_name}")
|
||||
plt.savefig(plot_name)
|
||||
# plt.show()
|
||||
|
||||
|
||||
def main(test_mode=False):
|
||||
#######################################
|
||||
# Print package versions
|
||||
#######################################
|
||||
print()
|
||||
pkgs = [
|
||||
"matplotlib", # Plotting library
|
||||
"tiktoken", # Tokenizer
|
||||
"torch", # Deep learning library
|
||||
"tqdm", # Progress bar
|
||||
"tensorflow", # For OpenAI's pretrained weights
|
||||
]
|
||||
for p in pkgs:
|
||||
print(f"{p} version: {version(p)}")
|
||||
print(50*"-")
|
||||
|
||||
#######################################
|
||||
# Download and prepare dataset
|
||||
#######################################
|
||||
file_path = "instruction-data.json"
|
||||
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"
|
||||
data = download_and_load_file(file_path, url)
|
||||
|
||||
train_portion = int(len(data) * 0.85) # 85% for training
|
||||
test_portion = int(len(data) * 0.1) # 10% for testing
|
||||
|
||||
train_data = data[:train_portion]
|
||||
test_data = data[train_portion:train_portion + test_portion]
|
||||
val_data = data[train_portion + test_portion:]
|
||||
|
||||
# Use very small subset for testing purposes
|
||||
if args.test_mode:
|
||||
train_data = train_data[:10]
|
||||
val_data = val_data[:10]
|
||||
test_data = test_data[:10]
|
||||
|
||||
print("Training set length:", len(train_data))
|
||||
print("Validation set length:", len(val_data))
|
||||
print("Test set length:", len(test_data))
|
||||
print(50*"-")
|
||||
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
print("Device:", device)
|
||||
print(50*"-")
|
||||
|
||||
customized_collate_fn = partial(custom_collate_fn, device=device, allowed_max_length=1024)
|
||||
|
||||
num_workers = 0
|
||||
batch_size = 8
|
||||
|
||||
torch.manual_seed(123)
|
||||
|
||||
train_dataset = InstructionDataset(train_data, tokenizer)
|
||||
train_loader = DataLoader(
|
||||
train_dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=customized_collate_fn,
|
||||
shuffle=True,
|
||||
drop_last=True,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
val_dataset = InstructionDataset(val_data, tokenizer)
|
||||
val_loader = DataLoader(
|
||||
val_dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=customized_collate_fn,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
num_workers=num_workers
|
||||
)
|
||||
|
||||
#######################################
|
||||
# Load pretrained model
|
||||
#######################################
|
||||
|
||||
# Small GPT model for testing purposes
|
||||
if args.test_mode:
|
||||
BASE_CONFIG = {
|
||||
"vocab_size": 50257,
|
||||
"context_length": 120,
|
||||
"drop_rate": 0.0,
|
||||
"qkv_bias": False,
|
||||
"emb_dim": 12,
|
||||
"n_layers": 1,
|
||||
"n_heads": 2
|
||||
}
|
||||
model = GPTModel(BASE_CONFIG)
|
||||
model.eval()
|
||||
device = "cpu"
|
||||
CHOOSE_MODEL = "Small test model"
|
||||
|
||||
# Code as it is used in the main chapter
|
||||
else:
|
||||
BASE_CONFIG = {
|
||||
"vocab_size": 50257, # Vocabulary size
|
||||
"context_length": 1024, # Context length
|
||||
"drop_rate": 0.0, # Dropout rate
|
||||
"qkv_bias": True # Query-key-value bias
|
||||
}
|
||||
|
||||
model_configs = {
|
||||
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
|
||||
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
|
||||
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
|
||||
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
|
||||
}
|
||||
|
||||
CHOOSE_MODEL = "gpt2-medium (355M)"
|
||||
|
||||
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
|
||||
|
||||
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
|
||||
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")
|
||||
|
||||
model = GPTModel(BASE_CONFIG)
|
||||
load_weights_into_gpt(model, params)
|
||||
model.eval()
|
||||
model.to(device)
|
||||
|
||||
print("Loaded model:", CHOOSE_MODEL)
|
||||
print(50*"-")
|
||||
|
||||
#######################################
|
||||
# Finetuning the model
|
||||
#######################################
|
||||
print("Initial losses")
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
|
||||
|
||||
print(" Training loss:", train_loss)
|
||||
print(" Validation loss:", val_loss)
|
||||
|
||||
start_time = time.time()
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
|
||||
|
||||
num_epochs = 2
|
||||
|
||||
torch.manual_seed(123)
|
||||
train_losses, val_losses, tokens_seen = train_model_simple(
|
||||
model, train_loader, val_loader, optimizer, device,
|
||||
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
|
||||
start_context=format_input(val_data[0]), tokenizer=tokenizer
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
execution_time_minutes = (end_time - start_time) / 60
|
||||
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
|
||||
|
||||
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
|
||||
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
|
||||
print(50*"-")
|
||||
|
||||
#######################################
|
||||
# Saving results
|
||||
#######################################
|
||||
print("Generating responses")
|
||||
for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
|
||||
|
||||
input_text = format_input(entry)
|
||||
|
||||
token_ids = generate(
|
||||
model=model,
|
||||
idx=text_to_token_ids(input_text, tokenizer).to(device),
|
||||
max_new_tokens=256,
|
||||
context_size=BASE_CONFIG["context_length"],
|
||||
eos_id=50256
|
||||
)
|
||||
generated_text = token_ids_to_text(token_ids, tokenizer)
|
||||
response_text = generated_text[len(input_text):].replace("### Response:", "").strip()
|
||||
|
||||
test_data[i]["model_response"] = response_text
|
||||
|
||||
test_data_path = "instruction-data-with-response-standalone.json"
|
||||
with open(test_data_path, "w") as file:
|
||||
json.dump(test_data, file, indent=4) # "indent" for pretty-printing
|
||||
print(f"Responses saved as {test_data_path}")
|
||||
|
||||
file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft-standalone.pth"
|
||||
torch.save(model.state_dict(), file_name)
|
||||
print(f"Model saved as {file_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Finetune a GPT model for classification"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--test_mode",
|
||||
default=False,
|
||||
action="store_true",
|
||||
help=("This flag runs the model in test mode for internal testing purposes. "
|
||||
"Otherwise, it runs the model as it is used in the chapter (recommended).")
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
main(args.test_mode)
|
||||
@ -0,0 +1,662 @@
|
||||
[
|
||||
{
|
||||
"instruction": "Rewrite the sentence using a simile.",
|
||||
"input": "The car is very fast.",
|
||||
"output": "The car is as fast as lightning.",
|
||||
"model_response": "The car is as fast as a bullet."
|
||||
},
|
||||
{
|
||||
"instruction": "What type of cloud is typically associated with thunderstorms?",
|
||||
"input": "",
|
||||
"output": "The type of cloud typically associated with thunderstorms is cumulonimbus.",
|
||||
"model_response": "The type of cloud associated with thunderstorms is a cumulus cloud."
|
||||
},
|
||||
{
|
||||
"instruction": "Name the author of 'Pride and Prejudice'.",
|
||||
"input": "",
|
||||
"output": "Jane Austen.",
|
||||
"model_response": "The author of 'Pride and Prejudice' is Jane Austen."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the periodic symbol for chlorine?",
|
||||
"input": "",
|
||||
"output": "The periodic symbol for chlorine is Cl.",
|
||||
"model_response": "The periodic symbol for chlorine is C."
|
||||
},
|
||||
{
|
||||
"instruction": "Correct the punctuation in the sentence.",
|
||||
"input": "Its time to go home.",
|
||||
"output": "The corrected sentence should be: 'It's time to go home.'",
|
||||
"model_response": "It's time to go home."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the sentence.",
|
||||
"input": "The lecture was delivered in a clear manner.",
|
||||
"output": "The lecture was delivered clearly.",
|
||||
"model_response": "The lecture was delivered in a clear manner."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a humorous anecdote.",
|
||||
"input": "",
|
||||
"output": "Why was the math book sad? Because it had too many problems!",
|
||||
"model_response": "The cat jumped over the fence."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the correct spelling: 'recieve' or 'receive'.",
|
||||
"input": "",
|
||||
"output": "The correct spelling is 'receive'.",
|
||||
"model_response": "The correct spelling is 'receive'."
|
||||
},
|
||||
{
|
||||
"instruction": "Create a sentence using the word 'nostalgia'.",
|
||||
"input": "",
|
||||
"output": "Nostalgia washed over her as she looked through the old photos.",
|
||||
"model_response": "Nostalgia is a strong emotion that can be felt in the past."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following numbers as prime or composite.",
|
||||
"input": ": 11, 14, 19.",
|
||||
"output": "Prime numbers: 11, 19\nComposite numbers: 14",
|
||||
"model_response": "Prime numbers: 11, 14, 19.Composite numbers: 14, 19."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the capital of Denmark?",
|
||||
"input": "",
|
||||
"output": "The capital of Denmark is Copenhagen.",
|
||||
"model_response": "The capital of Denmark is Copenhagen."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'wet'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'wet' is 'dry'.",
|
||||
"model_response": "The opposite of 'wet' is 'dry'."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the type of sentence.",
|
||||
"input": "Did you finish the report?",
|
||||
"output": "The type of sentence is interrogative.",
|
||||
"model_response": "The type of sentence is imperative."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the chemical symbol for mercury?",
|
||||
"input": "",
|
||||
"output": "Hg.",
|
||||
"model_response": "The chemical symbol for mercury is Hg."
|
||||
},
|
||||
{
|
||||
"instruction": "Describe the process of active transport in cells.",
|
||||
"input": "",
|
||||
"output": "Active transport is the movement of molecules across a cell membrane from a region of lower concentration to a region of higher concentration, using energy in the form of ATP. This process is crucial for maintaining cellular homeostasis.",
|
||||
"model_response": "Transport is the process by which a substance moves from one place to another by chemical change. It typically occurs by chemical reactions between atoms or molecules."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following items as either solid, liquid, or gas.",
|
||||
"input": "Mercury, oxygen, wood",
|
||||
"output": "Mercury - Liquid\nOxygen - Gas\nWood - Solid",
|
||||
"model_response": ", water\n\n\nSolid: Carbon\nLiquefied Carbon: Oxygen\nGas: Carbon dioxide\nGaseous Carbon: Carbon monoxide"
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 3 kilometers to meters.",
|
||||
"input": "",
|
||||
"output": "3 kilometers is 3000 meters.",
|
||||
"model_response": "3 kilometers is 3000 meters."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to use an indefinite pronoun.",
|
||||
"input": "Someone left a note.",
|
||||
"output": "A note was left by someone.",
|
||||
"model_response": "A note was left by someone."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide a synonym for 'excited'.",
|
||||
"input": "",
|
||||
"output": "A synonym for 'excited' is 'thrilled'.",
|
||||
"model_response": "A synonym for 'excited' is 'enthusiastic'."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence that follows the pattern: \"Never have I ever _____ without _____\"",
|
||||
"input": "",
|
||||
"output": "Never have I ever traveled without a map.",
|
||||
"model_response": "Never have I ever had without having."
|
||||
},
|
||||
{
|
||||
"instruction": "Pick out the adjective from the following list.",
|
||||
"input": "run, tall, quickly",
|
||||
"output": "The correct adjective from the list is 'tall.'",
|
||||
"model_response": "The adjective from the list is 'quick'."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 1000 grams to kilograms.",
|
||||
"input": "",
|
||||
"output": "1000 grams is equal to 1 kilogram.",
|
||||
"model_response": "1000 grams is 0.1 kilograms."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'deep'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'deep' is 'shallow'.",
|
||||
"model_response": "The opposite of 'deep' is 'light'."
|
||||
},
|
||||
{
|
||||
"instruction": "Categorize the given list of animals.",
|
||||
"input": "Shark, Dolphin, Trout",
|
||||
"output": "Fish: Shark, Trout\nMammals: Dolphin",
|
||||
"model_response": "Animals: Shark, Dolphin, Trout"
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'library' into Spanish.",
|
||||
"input": "",
|
||||
"output": "The Spanish word for 'library' is 'biblioteca'.",
|
||||
"model_response": "The Spanish translation of 'library' is 'l\u00edmite'."
|
||||
},
|
||||
{
|
||||
"instruction": "Define the term 'hyperbole'.",
|
||||
"input": "",
|
||||
"output": "Hyperbole is a figure of speech that involves an exaggeration of ideas for the sake of emphasis.",
|
||||
"model_response": "Hyperbole is a word used to describe a statement that is exaggerated or exaggerated in meaning."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'Hello' into Russian.",
|
||||
"input": "",
|
||||
"output": "\u041f\u0440\u0438\u0432\u0435\u0442 (Privet)",
|
||||
"model_response": "The Russian translation of 'Hello' is '\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u043f\u0440\u043e\u0438\u0437\u0432\u0435\u0434\u0435\u043d\u0438\u0439' (Vyacheslav Krasnoyarski)."
|
||||
},
|
||||
{
|
||||
"instruction": "Define the term 'kinetic energy'.",
|
||||
"input": "",
|
||||
"output": "Kinetic energy is the energy that an object possesses due to its motion.",
|
||||
"model_response": "Kinetic energy is the force that causes an object to accelerate or decelerate in response to an applied force."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'hot'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'hot' is 'cold'.",
|
||||
"model_response": "The opposite of 'hot' is 'cold'."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 5 miles to kilometers.",
|
||||
"input": "",
|
||||
"output": "5 miles is approximately 8.05 kilometers.",
|
||||
"model_response": "5 miles is 5000 meters."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the chemical formula for magnesium sulfate?",
|
||||
"input": "",
|
||||
"output": "The chemical formula for magnesium sulfate is MgSO4.",
|
||||
"model_response": "The chemical formula for magnesium sulfate is MgSO4."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to replace any clich\u00e9 phrases.",
|
||||
"input": "It's a piece of cake",
|
||||
"output": "It's very easy.",
|
||||
"model_response": ".\n\n\nA piece of cake is a cake."
|
||||
},
|
||||
{
|
||||
"instruction": "List five different types of vegetables.",
|
||||
"input": "",
|
||||
"output": "1. Carrot\n2. Broccoli\n3. Cucumber\n4. Tomato\n5. Spinach",
|
||||
"model_response": "1. Cucumber\n2. Broccoli\n3. Spinach\n4. Tomato\n5. Cucumber"
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 7 kilometers to meters.",
|
||||
"input": "",
|
||||
"output": "7 kilometers is 7000 meters.",
|
||||
"model_response": "7 kilometers is 7.048 meters."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'heavy'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'heavy' is 'light'.",
|
||||
"model_response": "The opposite of 'heavy' is 'light'."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the past tense of 'sing'?",
|
||||
"input": "",
|
||||
"output": "The past tense of 'sing' is 'sang.'",
|
||||
"model_response": "The past tense of 'sing' is 'to sing.'"
|
||||
},
|
||||
{
|
||||
"instruction": "What is the molecular formula for carbon dioxide?",
|
||||
"input": "",
|
||||
"output": "The molecular formula for carbon dioxide is CO2.",
|
||||
"model_response": "The molecular formula for carbon dioxide is CO2."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert this sentence to passive voice",
|
||||
"input": "The gardener watered the plants.",
|
||||
"output": "The plants were watered by the gardener.",
|
||||
"model_response": "The plants were watered by the gardener."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the past tense of 'throw'?",
|
||||
"input": "",
|
||||
"output": "The past tense of 'throw' is 'threw'.",
|
||||
"model_response": "The past tense of 'throw' is 'to throw'."
|
||||
},
|
||||
{
|
||||
"instruction": "Explain what a sonnet is.",
|
||||
"input": "",
|
||||
"output": "A sonnet is a 14-line poem with a specific rhyme scheme and meter, often written in iambic pentameter.",
|
||||
"model_response": "A sonnet is a poem composed in the first person."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'innovative'.",
|
||||
"input": "",
|
||||
"output": "The company is known for its innovative products.",
|
||||
"model_response": "She was very innovative and always had a new idea."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide the plural form of 'cactus'.",
|
||||
"input": "",
|
||||
"output": "The plural form of 'cactus' is 'cacti'.",
|
||||
"model_response": "The plural form of 'cactus' is 'cacti'."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate the phrase 'Where is the bathroom?' into German.",
|
||||
"input": "",
|
||||
"output": "The German translation of 'Where is the bathroom?' is 'Wo ist die Toilette?'",
|
||||
"model_response": "The German translation of 'Where is the bathroom?' is 'Wie es \u00fcber das B\u00fccher?'."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a past-tense verb that describes a person laughing.",
|
||||
"input": "",
|
||||
"output": "Laughed.",
|
||||
"model_response": "She laughed."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'transient.'",
|
||||
"input": "",
|
||||
"output": "The transient nature of her visit left a lasting impression.",
|
||||
"model_response": "She was seen by a passing car as it passed by."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'optimistic'.",
|
||||
"input": "",
|
||||
"output": "He remained optimistic despite the challenges he faced.",
|
||||
"model_response": "She was optimistic about the future."
|
||||
},
|
||||
{
|
||||
"instruction": "Re-word this sentence using an indirect question.",
|
||||
"input": "What time is the meeting?",
|
||||
"output": "Could you tell me what time the meeting is?",
|
||||
"model_response": "The meeting is at 7:00 PM."
|
||||
},
|
||||
{
|
||||
"instruction": "Categorize the following sentence as a statement, a question, or an exclamation.",
|
||||
"input": "What a beautiful day!",
|
||||
"output": "Exclamation.",
|
||||
"model_response": "A statement could be classified as a question."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'rich'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'rich' is 'poor'.",
|
||||
"model_response": "The opposite of 'rich' is 'poor'."
|
||||
},
|
||||
{
|
||||
"instruction": "Find a synonym for the given verb.",
|
||||
"input": "Begin",
|
||||
"output": "Commence",
|
||||
"model_response": "Begin\n\n### Input:\nBegin\n\n\nBegin\n\n### Input:\nBegin\n\nBegin\n### Input:\nBegin\n\nBegin"
|
||||
},
|
||||
{
|
||||
"instruction": "Edit the given text to ensure all plural nouns are spelled correctly.",
|
||||
"input": "The birds sings beautiful songs.",
|
||||
"output": "The birds sing beautiful songs.",
|
||||
"model_response": "The birds sings beautiful songs."
|
||||
},
|
||||
{
|
||||
"instruction": "Transform the following sentence into a question using \"could.\"",
|
||||
"input": "You can help me tomorrow.",
|
||||
"output": "Could you help me tomorrow?",
|
||||
"model_response": "Could you help me tomorrow?"
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following items: bicycle, rose, tiger.",
|
||||
"input": "",
|
||||
"output": "Vehicles: Bicycle\nPlants: Rose\nAnimals: Tiger",
|
||||
"model_response": "Animals: Tiger\nBicyclist: Tiger\nRose: Rose\nBike: Tiger"
|
||||
},
|
||||
{
|
||||
"instruction": "Define the term 'irony'.",
|
||||
"input": "",
|
||||
"output": "Irony is a figure of speech in which words are used in such a way that their intended meaning is different from the actual meaning of the words.",
|
||||
"model_response": "An irony is a statement that is at odds with the truth."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'Welcome' into German.",
|
||||
"input": "",
|
||||
"output": "The German translation of 'Welcome' is 'Willkommen'.",
|
||||
"model_response": "The German translation of 'Welcome' is 'Wie es tut mir leid'."
|
||||
},
|
||||
{
|
||||
"instruction": "Explain the primary function of the human heart.",
|
||||
"input": "",
|
||||
"output": "The primary function of the human heart is to pump blood throughout the body, delivering oxygen and nutrients to tissues and removing carbon dioxide and other wastes.",
|
||||
"model_response": "The primary function of the human heart is to pump blood to the brain and to supply the muscles and nerves that control movement."
|
||||
},
|
||||
{
|
||||
"instruction": "Reword the following sentence to the future tense.",
|
||||
"input": "He is reading a novel inspired by his grandmother.",
|
||||
"output": "He will be reading a novel inspired by his grandmother.",
|
||||
"model_response": "He is reading a novel inspired by his grandmother."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the given sentence into active voice.",
|
||||
"input": "The law was passed by the government.",
|
||||
"output": "The government passed the law.",
|
||||
"model_response": "The law was passed by the government."
|
||||
},
|
||||
{
|
||||
"instruction": "Create a sentence using the word 'inevitable'.",
|
||||
"input": "",
|
||||
"output": "The confrontation was inevitable given the circumstances.",
|
||||
"model_response": "The storm was inevitable."
|
||||
},
|
||||
{
|
||||
"instruction": "Categorize the following sentence as either factual or opinion-based.",
|
||||
"input": "Chocolate is the best dessert.",
|
||||
"output": "Opinion-based.",
|
||||
"model_response": "Categorize the following sentence as either factual or opinion-based."
|
||||
},
|
||||
{
|
||||
"instruction": "What is an antonym of 'old'?",
|
||||
"input": "",
|
||||
"output": "young.",
|
||||
"model_response": "An antonym of 'old' is 'young'."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide a synonym for 'hardworking'.",
|
||||
"input": "",
|
||||
"output": "A synonym for 'hardworking' is 'diligent'.",
|
||||
"model_response": "A synonym for 'hardworking' is 'smart'."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the boiling point of sulfur in Celsius?",
|
||||
"input": "",
|
||||
"output": "The boiling point of sulfur is 444.6 degrees Celsius.",
|
||||
"model_response": "The boiling point of sulfur is -114.5 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the plural form of 'child'?",
|
||||
"input": "",
|
||||
"output": "The plural form of 'child' is 'children'.",
|
||||
"model_response": "The plural form of 'child' is 'chunk'."
|
||||
},
|
||||
{
|
||||
"instruction": "What is an antonym of 'complicated'?",
|
||||
"input": "",
|
||||
"output": "An antonym of 'complicated' is 'simple'.",
|
||||
"model_response": "An antonym of 'complicated' is 'simple'."
|
||||
},
|
||||
{
|
||||
"instruction": "Name three forms of water.",
|
||||
"input": "",
|
||||
"output": "The three forms of water are solid (ice), liquid (water), and gas (steam).",
|
||||
"model_response": "Three forms of water are rain, snow, and water."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite this sentence as a question.",
|
||||
"input": "The dog chased the cat.",
|
||||
"output": "Did the dog chase the cat?",
|
||||
"model_response": "What is the cat chasing?"
|
||||
},
|
||||
{
|
||||
"instruction": "Split the following sentence into two declarative sentences: 'The movie was long but interesting.'",
|
||||
"input": "",
|
||||
"output": "The movie was long. It was interesting.",
|
||||
"model_response": "and 'The movie was interesting.'\n\n\nThe movie was long, but it was interesting."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following substances as acid, base, or neutral.",
|
||||
"input": "Lemon juice, Soap, Water",
|
||||
"output": "Acid: Lemon juice\nBase: Soap\nNeutral: Water",
|
||||
"model_response": "Acid: Soap\nBase: Soap\nNeutral: Water"
|
||||
},
|
||||
{
|
||||
"instruction": "What is a synonym of 'sad'?",
|
||||
"input": "",
|
||||
"output": "A synonym for 'sad' is 'unhappy'.",
|
||||
"model_response": "A synonym for 'sad' is 'angry'."
|
||||
},
|
||||
{
|
||||
"instruction": "Correct any spelling mistakes in the given sentence.",
|
||||
"input": "I prefer homemade cookies to store boaght.",
|
||||
"output": "I prefer homemade cookies to store bought.",
|
||||
"model_response": "I prefer homemade cookies to store boaght."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'transient'.",
|
||||
"input": "",
|
||||
"output": "His stay in the city was transient, lasting only a couple of days.",
|
||||
"model_response": "She was seen by a passerby as she passed by."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'I am lost' into Italian.",
|
||||
"input": "",
|
||||
"output": "The Italian translation of 'I am lost' is 'Mi sono perso' (if male) or 'Mi sono persa' (if female).",
|
||||
"model_response": "Il pi\u00f9 am\u00e9ricaino."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify this text as a technical document or a narrative.",
|
||||
"input": "This manual provides instructions for installing the software.",
|
||||
"output": "Technical document",
|
||||
"model_response": "Technical document"
|
||||
},
|
||||
{
|
||||
"instruction": "Sort the following list in descending order.",
|
||||
"input": "10, 2, 25, 16, 7",
|
||||
"output": "25, 16, 10, 7, 2.",
|
||||
"model_response": ", 3, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25"
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'Can I have some water?' into French.",
|
||||
"input": "",
|
||||
"output": "Puis-je avoir de l'eau?",
|
||||
"model_response": "Can I have some water?"
|
||||
},
|
||||
{
|
||||
"instruction": "Create a simile with the word 'as cold as'.",
|
||||
"input": "",
|
||||
"output": "Her hands were as cold as ice.",
|
||||
"model_response": "The temperature of 'as cold as' is -38 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following words by their grammatical categories: swim, beautiful, quickly",
|
||||
"input": "",
|
||||
"output": "Swim: Verb\nBeautiful: Adjective\nQuickly: Adverb",
|
||||
"model_response": "Classifying the following words by their grammatical categories: swim, beautiful, quickly"
|
||||
},
|
||||
{
|
||||
"instruction": "Calculate the density of an object with a mass of 15 grams and a volume of 5 cubic centimeters.",
|
||||
"input": "",
|
||||
"output": "The density of the object is 3 grams per cubic centimeter.",
|
||||
"model_response": "The density of the object is 15 grams/cm\u00b3."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the abbreviation for 'Master of Business Administration'?",
|
||||
"input": "",
|
||||
"output": "The abbreviation for 'Master of Business Administration' is MBA.",
|
||||
"model_response": "The abbreviation for 'Master of Business Administration' is MBA."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the following number from Roman numerals: IX.",
|
||||
"input": "",
|
||||
"output": "The number IX in Roman numerals is 9.",
|
||||
"model_response": "5 to decimal numbers: 0.5.\n\n\n0.5 is 0.5."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'horizontal'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'horizontal' is 'vertical'.",
|
||||
"model_response": "The opposite of 'horizontal' is 'vertical'."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'Where can I buy tickets?' into Italian.",
|
||||
"input": "",
|
||||
"output": "The Italian translation for 'Where can I buy tickets?' is 'Dove posso comprare i biglietti?'",
|
||||
"model_response": "The Italian translation of 'Where can I buy tickets?' is 'Ti amo'."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to replace any clich\u00e9s.",
|
||||
"input": "He was as cool as a cucumber.",
|
||||
"output": "He remained very calm.",
|
||||
"model_response": "He was as cool as a cucumber."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the main verb in the sentence.",
|
||||
"input": "The dog barked loudly.",
|
||||
"output": "The main verb in the sentence is 'barked'.",
|
||||
"model_response": "The main verb in the sentence is 'bark'."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'elucidate'.",
|
||||
"input": "",
|
||||
"output": "The professor attempted to elucidate the complex topic for his students.",
|
||||
"model_response": "The teacher had studied the book for a long time."
|
||||
},
|
||||
{
|
||||
"instruction": "Correct the sentence.",
|
||||
"input": "Me and my friend went to the store.",
|
||||
"output": "My friend and I went to the store.",
|
||||
"model_response": "We went to the store."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the formula for calculating work done?",
|
||||
"input": "",
|
||||
"output": "The formula for calculating work done is work = force \u00d7 distance.",
|
||||
"model_response": "The formula for calculating work done is W=M*G*A."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the chemical formula for ammonium nitrate?",
|
||||
"input": "",
|
||||
"output": "The chemical formula for ammonium nitrate is NH4NO3.",
|
||||
"model_response": "The chemical formula for ammonium nitrate is NH3."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the molecular formula for water?",
|
||||
"input": "",
|
||||
"output": "The molecular formula for water is H2O.",
|
||||
"model_response": "The molecular formula for water is H2O2."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the given sentence to describe the same thing in a positive way.",
|
||||
"input": "The food was not good.",
|
||||
"output": "The food could use some improvement.",
|
||||
"model_response": "The food was not good."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the opposite of 'lazy'?",
|
||||
"input": "",
|
||||
"output": "The opposite of 'lazy' is 'diligent'.",
|
||||
"model_response": "The opposite of 'lazy' is 'attentive'."
|
||||
},
|
||||
{
|
||||
"instruction": "Name three essential vitamins for human health.",
|
||||
"input": "",
|
||||
"output": "1. Vitamin A\n2. Vitamin C\n3. Vitamin D",
|
||||
"model_response": "1. B6\n2. B12\n3. B12\n### Instruction:\nWhat is the chemical symbol for iron?\n\n\nThe chemical symbol for iron is Fe."
|
||||
},
|
||||
{
|
||||
"instruction": "Define the term 'simile'.",
|
||||
"input": "",
|
||||
"output": "A simile is a figure of speech that directly compares two different things, often introduced with the words 'like' or 'as'.",
|
||||
"model_response": "A simile is a figurative device used to describe a person or thing that is simile or similitude in form or function."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the boiling point of chlorine in Celsius?",
|
||||
"input": "",
|
||||
"output": "The boiling point of chlorine is -34 degrees Celsius.",
|
||||
"model_response": "The boiling point of chlorine is -196 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'My name is' into French.",
|
||||
"input": "",
|
||||
"output": "The French translation of 'My name is' is 'Je m'appelle'.",
|
||||
"model_response": "The French translation of 'My name is' is 'Comment \u00e7a va?'."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 200 centimeters to meters.",
|
||||
"input": "",
|
||||
"output": "200 centimeters is 2 meters.",
|
||||
"model_response": "200 centimeters is 0.2 meters."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the chemical symbol for zinc?",
|
||||
"input": "",
|
||||
"output": "Zn.",
|
||||
"model_response": "The chemical symbol for zinc is Zn."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the formula for calculating force?",
|
||||
"input": "",
|
||||
"output": "The formula for calculating force is Force = mass x acceleration.",
|
||||
"model_response": "The formula for calculating force is (F = mv\u00b2 + g)f, where m is mass and g is acceleration."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'innovative'.",
|
||||
"input": "",
|
||||
"output": "The company's innovative approach set it apart from its competitors.",
|
||||
"model_response": "She was very innovative and always had a new idea."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'generous'.",
|
||||
"input": "",
|
||||
"output": "He is very generous and always helps those in need.",
|
||||
"model_response": "She was very generous and gave the money to the poor."
|
||||
},
|
||||
{
|
||||
"instruction": "What is an antonym of 'sharp'?",
|
||||
"input": "",
|
||||
"output": "An antonym of 'sharp' is 'dull'.",
|
||||
"model_response": "An antonym of 'sharp' is 'soft'."
|
||||
},
|
||||
{
|
||||
"instruction": "Describe the basic structure of a neuron.",
|
||||
"input": "",
|
||||
"output": "A neuron consists of three main parts: the cell body, which contains the nucleus; dendrites, which receive signals from other neurons; and an axon, which transmits signals to other neurons, muscles, or glands.",
|
||||
"model_response": "A neuron is a type of synapse in the brain."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate 'see you later' into Spanish.",
|
||||
"input": "",
|
||||
"output": "Hasta luego",
|
||||
"model_response": "Est\u00e1n leyendo see you later."
|
||||
},
|
||||
{
|
||||
"instruction": "What are the first 10 elements of the Fibonacci sequence?",
|
||||
"input": "",
|
||||
"output": "0, 1, 1, 2, 3, 5, 8, 13, 21, 34.",
|
||||
"model_response": "The first 10 elements of the Fibonacci sequence are: 1, 3, 5, 7, 11, 13, 15, 20."
|
||||
},
|
||||
{
|
||||
"instruction": "What is an antonym of 'transparent'?",
|
||||
"input": "",
|
||||
"output": "An antonym of 'transparent' is 'opaque'.",
|
||||
"model_response": "An antonym of 'transparent' is 'illuminate'."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the past tense of 'think'?",
|
||||
"input": "",
|
||||
"output": "The past tense of 'think' is 'thought'.",
|
||||
"model_response": "The past tense of 'think' is 'to think'."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify each sentence as either declarative, interrogative, imperative, or exclamatory.",
|
||||
"input": "Please open the door.",
|
||||
"output": "The classification of the sentence 'Please open the door.' is imperative.",
|
||||
"model_response": "Please open the door."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the sentence to use a negative adverb.",
|
||||
"input": "She always remembers to call.",
|
||||
"output": "She never forgets to call.",
|
||||
"model_response": "She always remembers to call."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 50 miles per hour to kilometers per hour.",
|
||||
"input": "",
|
||||
"output": "50 miles per hour is approximately 80.47 kilometers per hour.",
|
||||
"model_response": "50 miles per hour is approximately 32.5 kilometers per hour."
|
||||
}
|
||||
]
|
||||
5502
Codes/ch07/01_main-chapter-code/instruction-data.json
Normal file
5502
Codes/ch07/01_main-chapter-code/instruction-data.json
Normal file
File diff suppressed because it is too large
Load Diff
219
Codes/ch07/01_main-chapter-code/load-finetuned-model.ipynb
Normal file
219
Codes/ch07/01_main-chapter-code/load-finetuned-model.ipynb
Normal file
@ -0,0 +1,219 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1545a16b-bc8d-4e49-b9a6-db6631e7483d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f3f83194-82b9-4478-9550-5ad793467bd0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Load And Use Finetuned Model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "466b564e-4fd5-4d76-a3a1-63f9f0993b7e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook contains minimal code to load the finetuned model that was instruction finetuned and saved in chapter 7 via [ch07.ipynb](ch07.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "fd80e5f5-0f79-4a6c-bf31-2026e7d30e52",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tiktoken version: 0.7.0\n",
|
||||
"torch version: 2.4.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\n",
|
||||
" \"tiktoken\", # Tokenizer\n",
|
||||
" \"torch\", # Deep learning library\n",
|
||||
"]\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "ed86d6b7-f32d-4601-b585-a2ea3dbf7201",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"finetuned_model_path = Path(\"gpt2-medium355M-sft.pth\")\n",
|
||||
"if not finetuned_model_path.exists():\n",
|
||||
" print(\n",
|
||||
" f\"Could not find '{finetuned_model_path}'.\\n\"\n",
|
||||
" \"Run the `ch07.ipynb` notebook to finetune and save the finetuned model.\"\n",
|
||||
" )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "fb02584a-5e31-45d5-8377-794876907bc6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from previous_chapters import GPTModel\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"BASE_CONFIG = {\n",
|
||||
" \"vocab_size\": 50257, # Vocabulary size\n",
|
||||
" \"context_length\": 1024, # Context length\n",
|
||||
" \"drop_rate\": 0.0, # Dropout rate\n",
|
||||
" \"qkv_bias\": True # Query-key-value bias\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"model_configs = {\n",
|
||||
" \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
|
||||
" \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
|
||||
" \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
|
||||
" \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"CHOOSE_MODEL = \"gpt2-medium (355M)\"\n",
|
||||
"\n",
|
||||
"BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n",
|
||||
"\n",
|
||||
"model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n",
|
||||
"model = GPTModel(BASE_CONFIG)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "f1ccf2b7-176e-4cfd-af7a-53fb76010b94",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import torch\n",
|
||||
"\n",
|
||||
"model.load_state_dict(torch.load(\n",
|
||||
" \"gpt2-medium355M-sft.pth\",\n",
|
||||
" map_location=torch.device(\"cpu\"),\n",
|
||||
" weights_only=True\n",
|
||||
"))\n",
|
||||
"model.eval();"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "a1fd174e-9555-46c5-8780-19b0aa4f26e5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import tiktoken\n",
|
||||
"\n",
|
||||
"tokenizer = tiktoken.get_encoding(\"gpt2\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "2a4c0129-efe5-46e9-bb90-ba08d407c1a2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"prompt = \"\"\"Below is an instruction that describes a task. Write a response \n",
|
||||
"that appropriately completes the request.\n",
|
||||
"\n",
|
||||
"### Instruction:\n",
|
||||
"Convert the active sentence to passive: 'The chef cooks the meal every day.'\n",
|
||||
"\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "1e26862c-10b5-4a0f-9dd6-b6ddbad2fc3f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The meal is cooked every day by the chef.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from previous_chapters import (\n",
|
||||
" generate,\n",
|
||||
" text_to_token_ids,\n",
|
||||
" token_ids_to_text\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"def extract_response(response_text, input_text):\n",
|
||||
" return response_text[len(input_text):].replace(\"### Response:\", \"\").strip()\n",
|
||||
"\n",
|
||||
"torch.manual_seed(123)\n",
|
||||
"\n",
|
||||
"token_ids = generate(\n",
|
||||
" model=model,\n",
|
||||
" idx=text_to_token_ids(prompt, tokenizer),\n",
|
||||
" max_new_tokens=35,\n",
|
||||
" context_size=BASE_CONFIG[\"context_length\"],\n",
|
||||
" eos_id=50256\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"response = token_ids_to_text(token_ids, tokenizer)\n",
|
||||
"response = extract_response(response, prompt)\n",
|
||||
"print(response)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
126
Codes/ch07/01_main-chapter-code/ollama_evaluate.py
Normal file
126
Codes/ch07/01_main-chapter-code/ollama_evaluate.py
Normal file
@ -0,0 +1,126 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
#
|
||||
# A minimal instruction finetuning file based on the code in chapter 7
|
||||
|
||||
import json
|
||||
import psutil
|
||||
from tqdm import tqdm
|
||||
import urllib.request
|
||||
|
||||
|
||||
def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
|
||||
# Create the data payload as a dictionary
|
||||
data = {
|
||||
"model": model,
|
||||
"messages": [
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
"options": { # Settings below are required for deterministic responses
|
||||
"seed": 123,
|
||||
"temperature": 0,
|
||||
"num_ctx": 2048
|
||||
}
|
||||
}
|
||||
|
||||
# Convert the dictionary to a JSON formatted string and encode it to bytes
|
||||
payload = json.dumps(data).encode("utf-8")
|
||||
|
||||
# Create a request object, setting the method to POST and adding necessary headers
|
||||
request = urllib.request.Request(url, data=payload, method="POST")
|
||||
request.add_header("Content-Type", "application/json")
|
||||
|
||||
# Send the request and capture the response
|
||||
response_data = ""
|
||||
with urllib.request.urlopen(request) as response:
|
||||
# Read and decode the response
|
||||
while True:
|
||||
line = response.readline().decode("utf-8")
|
||||
if not line:
|
||||
break
|
||||
response_json = json.loads(line)
|
||||
response_data += response_json["message"]["content"]
|
||||
|
||||
return response_data
|
||||
|
||||
|
||||
def check_if_running(process_name):
|
||||
running = False
|
||||
for proc in psutil.process_iter(["name"]):
|
||||
if process_name in proc.info["name"]:
|
||||
running = True
|
||||
break
|
||||
return running
|
||||
|
||||
|
||||
def format_input(entry):
|
||||
instruction_text = (
|
||||
f"Below is an instruction that describes a task. "
|
||||
f"Write a response that appropriately completes the request."
|
||||
f"\n\n### Instruction:\n{entry['instruction']}"
|
||||
)
|
||||
|
||||
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
|
||||
|
||||
return instruction_text + input_text
|
||||
|
||||
|
||||
def main(file_path):
|
||||
ollama_running = check_if_running("ollama")
|
||||
|
||||
if not ollama_running:
|
||||
raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
|
||||
print("Ollama running:", check_if_running("ollama"))
|
||||
|
||||
with open(file_path, "r") as file:
|
||||
test_data = json.load(file)
|
||||
|
||||
model = "llama3"
|
||||
scores = generate_model_scores(test_data, "model_response", model)
|
||||
print(f"Number of scores: {len(scores)} of {len(test_data)}")
|
||||
print(f"Average score: {sum(scores)/len(scores):.2f}\n")
|
||||
|
||||
|
||||
def generate_model_scores(json_data, json_key, model="llama3"):
|
||||
scores = []
|
||||
for entry in tqdm(json_data, desc="Scoring entries"):
|
||||
if entry[json_key] == "":
|
||||
scores.append(0)
|
||||
else:
|
||||
prompt = (
|
||||
f"Given the input `{format_input(entry)}` "
|
||||
f"and correct output `{entry['output']}`, "
|
||||
f"score the model response `{entry[json_key]}`"
|
||||
f" on a scale from 0 to 100, where 100 is the best score. "
|
||||
f"Respond with the integer number only."
|
||||
)
|
||||
score = query_model(prompt, model)
|
||||
try:
|
||||
scores.append(int(score))
|
||||
except ValueError:
|
||||
print(f"Could not convert score: {score}")
|
||||
continue
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Evaluate model responses with ollama"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--file_path",
|
||||
required=True,
|
||||
help=(
|
||||
"The path to the test dataset `.json` file with the"
|
||||
" `'output'` and `'model_response'` keys"
|
||||
)
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
main(file_path=args.file_path)
|
||||
470
Codes/ch07/01_main-chapter-code/previous_chapters.py
Normal file
470
Codes/ch07/01_main-chapter-code/previous_chapters.py
Normal file
@ -0,0 +1,470 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
#
|
||||
# This file collects all the relevant code that we covered thus far
|
||||
# throughout Chapters 2-6.
|
||||
# This file can be run as a standalone script.
|
||||
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import numpy as np
|
||||
import tiktoken
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 2
|
||||
#####################################
|
||||
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.tokenizer = tokenizer
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True, num_workers=0):
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 3
|
||||
#####################################
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 4
|
||||
#####################################
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
|
||||
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
|
||||
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.layers(x)
|
||||
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"])
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_resid = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x
|
||||
x = self.norm1(x)
|
||||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_resid(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
# Shortcut connection for feed-forward block
|
||||
shortcut = x
|
||||
x = self.norm2(x)
|
||||
x = self.ff(x)
|
||||
x = self.drop_resid(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||||
|
||||
def forward(self, in_idx):
|
||||
batch_size, seq_len = in_idx.shape
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_emb(x)
|
||||
x = self.trf_blocks(x)
|
||||
x = self.final_norm(x)
|
||||
logits = self.out_head(x)
|
||||
return logits
|
||||
|
||||
|
||||
def generate_text_simple(model, idx, max_new_tokens, context_size):
|
||||
# idx is (B, T) array of indices in the current context
|
||||
for _ in range(max_new_tokens):
|
||||
|
||||
# Crop current context if it exceeds the supported context size
|
||||
# E.g., if LLM supports only 5 tokens, and the context size is 10
|
||||
# then only the last 5 tokens are used as context
|
||||
idx_cond = idx[:, -context_size:]
|
||||
|
||||
# Get the predictions
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
|
||||
# Focus only on the last time step
|
||||
# (batch, n_token, vocab_size) becomes (batch, vocab_size)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# Get the idx of the vocab entry with the highest logits value
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch, 1)
|
||||
|
||||
# Append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
|
||||
|
||||
return idx
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 5
|
||||
#####################################
|
||||
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
|
||||
|
||||
# For-loop is the same as before: Get logits, and only focus on last time step
|
||||
for _ in range(max_new_tokens):
|
||||
idx_cond = idx[:, -context_size:]
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# New: Filter logits with top_k sampling
|
||||
if top_k is not None:
|
||||
# Keep only top_k values
|
||||
top_logits, _ = torch.topk(logits, top_k)
|
||||
min_val = top_logits[:, -1]
|
||||
logits = torch.where(logits < min_val, torch.tensor(float('-inf')).to(logits.device), logits)
|
||||
|
||||
# New: Apply temperature scaling
|
||||
if temperature > 0.0:
|
||||
logits = logits / temperature
|
||||
|
||||
# Apply softmax to get probabilities
|
||||
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
|
||||
|
||||
# Sample from the distribution
|
||||
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
|
||||
|
||||
# Otherwise same as before: get idx of the vocab entry with the highest logits value
|
||||
else:
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
|
||||
|
||||
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
|
||||
break
|
||||
|
||||
# Same as before: append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
|
||||
|
||||
return idx
|
||||
|
||||
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train()
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval()
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text_simple(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train()
|
||||
|
||||
|
||||
def assign(left, right):
|
||||
if left.shape != right.shape:
|
||||
raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
|
||||
return torch.nn.Parameter(torch.tensor(right))
|
||||
|
||||
|
||||
def load_weights_into_gpt(gpt, params):
|
||||
gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
|
||||
gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
|
||||
|
||||
for b in range(len(params["blocks"])):
|
||||
q_w, k_w, v_w = np.split(
|
||||
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
|
||||
gpt.trf_blocks[b].att.W_query.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
|
||||
gpt.trf_blocks[b].att.W_key.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
|
||||
gpt.trf_blocks[b].att.W_value.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_value.weight, v_w.T)
|
||||
|
||||
q_b, k_b, v_b = np.split(
|
||||
(params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
|
||||
gpt.trf_blocks[b].att.W_query.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_query.bias, q_b)
|
||||
gpt.trf_blocks[b].att.W_key.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_key.bias, k_b)
|
||||
gpt.trf_blocks[b].att.W_value.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_value.bias, v_b)
|
||||
|
||||
gpt.trf_blocks[b].att.out_proj.weight = assign(
|
||||
gpt.trf_blocks[b].att.out_proj.weight,
|
||||
params["blocks"][b]["attn"]["c_proj"]["w"].T)
|
||||
gpt.trf_blocks[b].att.out_proj.bias = assign(
|
||||
gpt.trf_blocks[b].att.out_proj.bias,
|
||||
params["blocks"][b]["attn"]["c_proj"]["b"])
|
||||
|
||||
gpt.trf_blocks[b].ff.layers[0].weight = assign(
|
||||
gpt.trf_blocks[b].ff.layers[0].weight,
|
||||
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
|
||||
gpt.trf_blocks[b].ff.layers[0].bias = assign(
|
||||
gpt.trf_blocks[b].ff.layers[0].bias,
|
||||
params["blocks"][b]["mlp"]["c_fc"]["b"])
|
||||
gpt.trf_blocks[b].ff.layers[2].weight = assign(
|
||||
gpt.trf_blocks[b].ff.layers[2].weight,
|
||||
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
|
||||
gpt.trf_blocks[b].ff.layers[2].bias = assign(
|
||||
gpt.trf_blocks[b].ff.layers[2].bias,
|
||||
params["blocks"][b]["mlp"]["c_proj"]["b"])
|
||||
|
||||
gpt.trf_blocks[b].norm1.scale = assign(
|
||||
gpt.trf_blocks[b].norm1.scale,
|
||||
params["blocks"][b]["ln_1"]["g"])
|
||||
gpt.trf_blocks[b].norm1.shift = assign(
|
||||
gpt.trf_blocks[b].norm1.shift,
|
||||
params["blocks"][b]["ln_1"]["b"])
|
||||
gpt.trf_blocks[b].norm2.scale = assign(
|
||||
gpt.trf_blocks[b].norm2.scale,
|
||||
params["blocks"][b]["ln_2"]["g"])
|
||||
gpt.trf_blocks[b].norm2.shift = assign(
|
||||
gpt.trf_blocks[b].norm2.shift,
|
||||
params["blocks"][b]["ln_2"]["b"])
|
||||
|
||||
gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
|
||||
gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
|
||||
gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
|
||||
|
||||
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
|
||||
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
|
||||
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
|
||||
# Plot training and validation loss against epochs
|
||||
ax1.plot(epochs_seen, train_losses, label="Training loss")
|
||||
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel("Loss")
|
||||
ax1.legend(loc="upper right")
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) # only show integer labels on x-axis
|
||||
|
||||
# Create a second x-axis for tokens seen
|
||||
ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
|
||||
fig.tight_layout() # Adjust layout to make room
|
||||
plt.savefig("loss-plot.pdf")
|
||||
plt.show()
|
||||
16
Codes/ch07/01_main-chapter-code/tests.py
Normal file
16
Codes/ch07/01_main-chapter-code/tests.py
Normal file
@ -0,0 +1,16 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
|
||||
# File for internal use (unit tests)
|
||||
|
||||
|
||||
import subprocess
|
||||
|
||||
|
||||
def test_gpt_class_finetune():
|
||||
command = ["python", "ch06/01_main-chapter-code/gpt_class_finetune.py", "--test_mode"]
|
||||
|
||||
result = subprocess.run(command, capture_output=True, text=True)
|
||||
assert result.returncode == 0, f"Script exited with errors: {result.stderr}"
|
||||
80
Codes/ch07/02_dataset-utilities/README.md
Normal file
80
Codes/ch07/02_dataset-utilities/README.md
Normal file
@ -0,0 +1,80 @@
|
||||
# Chapter 7: Finetuning to Follow Instructions
|
||||
|
||||
This folder contains utility code that can be used for preparing an instruction dataset.
|
||||
|
||||
Install the additional package requirements via:
|
||||
|
||||
```bash
|
||||
pip install -r requirements-extra.txt
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
### Finding Near Duplicates
|
||||
|
||||
The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,
|
||||
|
||||
|
||||
|
||||
```bash
|
||||
python find-near-duplicates.py --json_file instruction-examples.json
|
||||
```
|
||||
|
||||
```
|
||||
scikit-learn version: 1.3.1
|
||||
|
||||
|
||||
==================================================
|
||||
Searching 'instruction' for duplicates ...
|
||||
==================================================
|
||||
Duplicate pair found with similarity 0.94:
|
||||
1. Edit the following sentence to make it more formal.
|
||||
2. Edit the sentence to make it more formal.
|
||||
|
||||
Duplicate pair found with similarity 1.00:
|
||||
1. Name a dwarf planet in our solar system.
|
||||
2. Name a dwarf planet in our solar system.
|
||||
|
||||
Duplicate pair found with similarity 0.91:
|
||||
1. Change the sentences from active voice to passive voice.
|
||||
2. Change the sentence from passive to active voice.
|
||||
|
||||
|
||||
|
||||
==================================================
|
||||
Searching 'input' for duplicates ...
|
||||
==================================================
|
||||
No duplicates found
|
||||
|
||||
|
||||
==================================================
|
||||
Searching 'output' for duplicates ...
|
||||
==================================================
|
||||
Duplicate pair found with similarity 1.00:
|
||||
1. One dwarf planet in our solar system is Pluto.
|
||||
2. One dwarf planet in our solar system is Pluto.
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
You can use the `--threshold` setting with a value between 0 and 1 to decrease or increase the sensitivity.
|
||||
The default threshold is 0.9.
|
||||
|
||||
|
||||
|
||||
|
||||
## Creating Passive Voice Entries
|
||||
|
||||
- The [create-passive-voice-entries.ipynb](create-passive-voice-entries.ipynb) notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below
|
||||
|
||||
```python
|
||||
{
|
||||
'instruction': 'Identify the verb in the following sentence',
|
||||
'input': 'The cat sleeps on the couch.',
|
||||
'output': 'The verb in the sentence is "sleeps."',
|
||||
'output_2': 'The sentence is "sleeps."' # <---- Newly created entry
|
||||
}
|
||||
```
|
||||
4
Codes/ch07/02_dataset-utilities/config.json
Normal file
4
Codes/ch07/02_dataset-utilities/config.json
Normal file
@ -0,0 +1,4 @@
|
||||
{
|
||||
"OPENAI_API_KEY": "sk-...",
|
||||
"_comment": "Enter your API key from https://platform.openai.com/api-keys"
|
||||
}
|
||||
@ -0,0 +1,426 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Create \"Passive Voice\" Entries for an Instruction Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses OpenAI's GPT-4 to create \"passive voice\" entries for an instruction dataset, as shown in the example below\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{ \n",
|
||||
" 'instruction': 'Identify the verb in the following sentence',\n",
|
||||
" 'input': 'The cat sleeps on the couch.',\n",
|
||||
" 'output': 'The verb in the sentence is \"sleeps.\"',\n",
|
||||
" 'output_2': 'The sentence is \"sleeps.\"' # <---- Newly created entry\n",
|
||||
"} \n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# pip install -r requirements-extra.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"openai version: 1.30.3\n",
|
||||
"tqdm version: 4.65.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\"openai\", # OpenAI API\n",
|
||||
" \"tqdm\", # Progress bar\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Test OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's test if the OpenAI API is correctly set up\n",
|
||||
"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
|
||||
"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
|
||||
"- Creating the ~200 passive voice entries using the code in this notebook costs about $0.13 (13 cents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys\n",
|
||||
"- Make sure not to share this key with anyone\n",
|
||||
"- Add this secret key (`\"sk-...\"`) to the `config.json` file in this folder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"# Load API key from a JSON file. \n",
|
||||
"# Make sure to replace \"sk-...\" with your actual API key from https://platform.openai.com/api-keys\n",
|
||||
"with open(\"config.json\", \"r\") as config_file:\n",
|
||||
" config = json.load(config_file)\n",
|
||||
" api_key = config[\"OPENAI_API_KEY\"]\n",
|
||||
"\n",
|
||||
"client = OpenAI(api_key=api_key)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "08e9ef2e-e816-4283-840e-43625791ad33",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Breakfast was eaten by me.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
|
||||
" response = client.chat.completions.create(\n",
|
||||
" model=model,\n",
|
||||
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
|
||||
" temperature=0.0,\n",
|
||||
" )\n",
|
||||
" return response.choices[0].message.content\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Prepare input\n",
|
||||
"sentence = \"I ate breakfast\"\n",
|
||||
"prompt = f\"Convert the following sentence to passive voice: '{sentence}'\"\n",
|
||||
"run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Next, we load the file we want to modify:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 200\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"json_file = \"instruction-examples.json\"\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
" \n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- And we try the OpenAI chat API on a small sample first to ensure that it works correctly:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The verb in the sentence is \"sleeps.\"\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The sentence is \"sleeps.\"\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The plural form of \"goose\" is \"geese.\"\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The plural form of \"goose\" is referred to as \"geese.\"\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The three primary colors are red, blue, and yellow.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> Red, blue, and yellow are considered the three primary colors.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> They had finished the game.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The game had been finished by them.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Input:\n",
|
||||
">> The abbreviation for \"Doctor of Philosophy\" is Ph.D.\n",
|
||||
"\n",
|
||||
"Output:\n",
|
||||
">> The abbreviation \"Ph.D.\" is used for \"Doctor of Philosophy\".\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for entry in json_data[:5]:\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" \n",
|
||||
" print(\"\\nInput:\")\n",
|
||||
" print(\">>\", text)\n",
|
||||
" print(\"\\nOutput:\")\n",
|
||||
" print(\">>\", run_chatgpt(prompt, client))\n",
|
||||
" print(\"\\n-------------------------\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now extend the code to add the generated entries to the `json_data` and add a progress bar:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00, 1.23it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tqdm import tqdm # a progress bar tool\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cd144282-0596-4e9b-9815-322cff34b400",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- One more time, let's make sure that the new entries (`\"output_2\"`) look ok"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "5b6eaa87-a86d-42a1-a20a-b764b0d559d4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',\n",
|
||||
" 'input': '',\n",
|
||||
" 'output': 'The verb in the sentence is \"sleeps.\"',\n",
|
||||
" 'output_2': 'The sentence is \"sleeps.\"'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6970e8cf-2b18-4e3d-9f25-e6a4489c39a7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Finally, if everything above looks ok, let's run the conversion to passive voice on our entire json dataset (this takes about 3 minutes):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "eef99407-8ffd-4a63-b7ab-ffe30c0f0677",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00, 1.12s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i, entry in tqdm(enumerate(json_data), total=len(json_data)):\n",
|
||||
" text = entry[\"output\"]\n",
|
||||
" prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
|
||||
" json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ac91ae85-2f0e-456a-be1d-56e1958f30d8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- After the conversion is completed, we save the file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "330cc30a-b08e-4bf0-bee2-bec0da4208de",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"new_json_file = json_file.replace(\".json\", \"-modified.json\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"with open(new_json_file, \"w\") as file:\n",
|
||||
" json.dump(json_data, file, indent=4) # \"indent\" for pretty-printing"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
153
Codes/ch07/02_dataset-utilities/find-near-duplicates.py
Normal file
153
Codes/ch07/02_dataset-utilities/find-near-duplicates.py
Normal file
@ -0,0 +1,153 @@
|
||||
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from sklearn import __version__ as sklearn_version
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
|
||||
# Sample JSON dataset
|
||||
example_data = [
|
||||
{"instruction": "What is the capital of Italy?",
|
||||
"input": "", "output": "The capital of Italy is Rome."
|
||||
},
|
||||
{"instruction": "What's the capital city of Italy?",
|
||||
"input": "", "output": "The capital city is Rome."
|
||||
},
|
||||
{"instruction": "Identify the main verb in the sentence: 'The cat sleeps on the couch.'",
|
||||
"input": "", "output": "The verb is 'sleeps'."
|
||||
},
|
||||
{"instruction": "Identify the verb in the following sentence: The cat sleeps on the couch.",
|
||||
"input": "", "output": "The verb in the sentence is \"sleeps.\""
|
||||
},
|
||||
# ...
|
||||
]
|
||||
|
||||
|
||||
def preprocess_text(text):
|
||||
# Lowercase the text
|
||||
text = text.lower()
|
||||
# Remove punctuation
|
||||
text = re.sub(r'[^\w\s]', '', text)
|
||||
return text
|
||||
|
||||
|
||||
def find_near_duplicates(json_data, threshold=0.75, key="instruction"):
|
||||
"""The higher the threshold, the more similar the texts have to be to match"""
|
||||
|
||||
# Extract instructions
|
||||
text = [preprocess_text(item[key]) for item in json_data if item[key]]
|
||||
near_duplicates = []
|
||||
indices_to_remove = set()
|
||||
|
||||
if not text:
|
||||
return {}, near_duplicates
|
||||
|
||||
# Vectorize the text data
|
||||
vectorizer = TfidfVectorizer(stop_words=None, analyzer='char', ngram_range=(1, 3))
|
||||
tfidf_matrix = vectorizer.fit_transform(text)
|
||||
|
||||
# Compute cosine similarity between each pair of entries
|
||||
cos_sim_matrix = cosine_similarity(tfidf_matrix)
|
||||
|
||||
# Find pairs of near-duplicate instructions based on the threshold
|
||||
|
||||
for i in range(len(cos_sim_matrix)):
|
||||
for j in range(i+1, len(cos_sim_matrix)):
|
||||
if cos_sim_matrix[i, j] > threshold:
|
||||
if len(json_data[i][key]) <= 1 or len(json_data[j][key]) <= 1:
|
||||
continue
|
||||
near_duplicates.append((json_data[i], json_data[j], cos_sim_matrix[i, j]))
|
||||
if key in ("input", "output"): # Don't remove duplicates based on the instruction
|
||||
indices_to_remove.add(j) # Mark the second entry for removal
|
||||
|
||||
# Remove the near-duplicate entries
|
||||
filtered_json_data = [item for index, item in enumerate(json_data) if index not in indices_to_remove]
|
||||
|
||||
return filtered_json_data, near_duplicates
|
||||
|
||||
|
||||
def find_print_and_remove_near_duplicates(json_data, remove_duplicates=False, threshold=0.75):
|
||||
"""
|
||||
Searches each key in the first JSON object for duplicates across a list of JSON objects.
|
||||
Prints the duplicates if found.
|
||||
"""
|
||||
for key in json_data[0].keys():
|
||||
|
||||
if remove_duplicates:
|
||||
json_data, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold)
|
||||
else:
|
||||
_, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold)
|
||||
separator = 50 * '='
|
||||
print(f"\n\n{separator}\nSearching '{key}' for duplicates ...\n{separator}")
|
||||
if not near_duplicates:
|
||||
print("No duplicates found")
|
||||
else:
|
||||
for dup in near_duplicates:
|
||||
print(
|
||||
f"Duplicate pair found with similarity {dup[2]:.2f}:\n"
|
||||
f"1. {dup[0][key]}\n2. {dup[1][key]}\n"
|
||||
)
|
||||
return json_data
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("scikit-learn version:", sklearn_version)
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--json_file",
|
||||
type=str,
|
||||
help=("Path to the dataset JSON file")
|
||||
)
|
||||
parser.add_argument(
|
||||
"--threshold",
|
||||
type=float,
|
||||
default=0.9,
|
||||
help=("A sensitivity threshold between 0 and 1 where 1 is strictest")
|
||||
)
|
||||
parser.add_argument(
|
||||
"--remove_duplicates",
|
||||
action='store_true',
|
||||
default=False,
|
||||
help=(
|
||||
"Removes duplicates based on the 'input' or 'output' keys "
|
||||
" (but not the 'instruction') and saves the cleaned JSON file as --json_output_file"
|
||||
)
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json_output_file",
|
||||
type=str,
|
||||
help=("Path to the dataset JSON file")
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.remove_duplicates and not args.json_output_file:
|
||||
raise ValueError(
|
||||
"Provide an output file via --json_output_file "
|
||||
"to save the cleaned JSON data."
|
||||
)
|
||||
|
||||
if not args.json_file:
|
||||
json_data = example_data
|
||||
|
||||
else:
|
||||
with open(args.json_file, "r") as file:
|
||||
json_data = json.load(file)
|
||||
|
||||
json_data = find_print_and_remove_near_duplicates(
|
||||
json_data=json_data,
|
||||
remove_duplicates=args.remove_duplicates,
|
||||
threshold=args.threshold
|
||||
)
|
||||
|
||||
if args.remove_duplicates:
|
||||
with open(args.json_output_file, "w") as file:
|
||||
json.dump(json_data, file, indent=4)
|
||||
1202
Codes/ch07/02_dataset-utilities/instruction-examples-modified.json
Normal file
1202
Codes/ch07/02_dataset-utilities/instruction-examples-modified.json
Normal file
File diff suppressed because it is too large
Load Diff
1002
Codes/ch07/02_dataset-utilities/instruction-examples.json
Normal file
1002
Codes/ch07/02_dataset-utilities/instruction-examples.json
Normal file
File diff suppressed because it is too large
Load Diff
3
Codes/ch07/02_dataset-utilities/requirements-extra.txt
Normal file
3
Codes/ch07/02_dataset-utilities/requirements-extra.txt
Normal file
@ -0,0 +1,3 @@
|
||||
openai>=1.30.3
|
||||
scikit-learn>=1.3.1
|
||||
tqdm>=4.65.0
|
||||
26
Codes/ch07/03_model-evaluation/README.md
Normal file
26
Codes/ch07/03_model-evaluation/README.md
Normal file
@ -0,0 +1,26 @@
|
||||
# Chapter 7: Finetuning to Follow Instructions
|
||||
|
||||
This folder contains utility code that can be used for model evaluation.
|
||||
|
||||
|
||||
|
||||
|
||||
## Evaluating Instruction Responses Using the OpenAI API
|
||||
|
||||
|
||||
- The [llm-instruction-eval-openai.ipynb](llm-instruction-eval-openai.ipynb) notebook uses OpenAI's GPT-4 to evaluate responses generated by instruction finetuned models. It works with a JSON file in the following format:
|
||||
|
||||
```python
|
||||
{
|
||||
"instruction": "What is the atomic number of helium?",
|
||||
"input": "",
|
||||
"output": "The atomic number of helium is 2.", # <-- The target given in the test set
|
||||
"model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
|
||||
"model 2 response": "\nThe atomic number of helium is 3." # <-- Response by a 2nd LLM
|
||||
},
|
||||
```
|
||||
|
||||
|
||||
## Evaluating Instruction Responses Locally Using Ollama
|
||||
|
||||
- The [llm-instruction-eval-ollama.ipynb](llm-instruction-eval-ollama.ipynb) notebook offers an alternative to the one above, utilizing a locally downloaded Llama 3 model via Ollama.
|
||||
4
Codes/ch07/03_model-evaluation/config.json
Normal file
4
Codes/ch07/03_model-evaluation/config.json
Normal file
@ -0,0 +1,4 @@
|
||||
{
|
||||
"OPENAI_API_KEY": "sk-...",
|
||||
"_comment": "Enter your API key from https://platform.openai.com/api-keys"
|
||||
}
|
||||
702
Codes/ch07/03_model-evaluation/eval-example-data.json
Normal file
702
Codes/ch07/03_model-evaluation/eval-example-data.json
Normal file
@ -0,0 +1,702 @@
|
||||
[
|
||||
{
|
||||
"instruction": "Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.",
|
||||
"input": "",
|
||||
"output": "The hypotenuse of the triangle is 10 cm.",
|
||||
"model 1 response": "\nThe hypotenuse of the triangle is 3 cm.",
|
||||
"model 2 response": "\nThe hypotenuse of the triangle is 12 cm."
|
||||
},
|
||||
{
|
||||
"instruction": "Name 3 different animals that are active during the day.",
|
||||
"input": "",
|
||||
"output": "1. Squirrel\n2. Eagle\n3. Tiger",
|
||||
"model 1 response": "\n1. Squirrel\n2. Tiger\n3. Eagle\n4. Cobra\n5. Tiger\n6. Cobra",
|
||||
"model 2 response": "\n1. Squirrel\n2. Eagle\n3. Tiger"
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence in a more formal way:\n\"I need to find out what's wrong.\"",
|
||||
"input": "",
|
||||
"output": "I must ascertain what is incorrect.",
|
||||
"model 1 response": "\nWhat is incorrect?",
|
||||
"model 2 response": "\nI need to find out what's wrong."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the interjection in the sentence.",
|
||||
"input": "Wow, that was an amazing trick!",
|
||||
"output": "The interjection in the sentence is 'Wow'.",
|
||||
"model 1 response": "\nThe interjection in the sentence is 'Wow'.",
|
||||
"model 2 response": "\nThe interjection in the sentence is 'Wow'."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the type of sentence.",
|
||||
"input": "Did you finish the report?",
|
||||
"output": "The type of sentence is interrogative.",
|
||||
"model 1 response": "\nThe type of sentence is exclamatory.",
|
||||
"model 2 response": "\nThe type of sentence is exclamatory."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the following sentence into future tense.",
|
||||
"input": "She walks to school.",
|
||||
"output": "She will walk to school.",
|
||||
"model 1 response": "\nShe will walk to school.",
|
||||
"model 2 response": "\nShe will walk to school."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert 500 grams to ounces.",
|
||||
"input": "",
|
||||
"output": "500 grams is approximately 17.64 ounces.",
|
||||
"model 1 response": "\n500 grams is approximately 99.21 ounces.",
|
||||
"model 2 response": "\n500 grams is approximately 0.05 ounces."
|
||||
},
|
||||
{
|
||||
"instruction": "Edit the following sentence to remove all passive voice.",
|
||||
"input": "The song was composed by the artist.",
|
||||
"output": "The artist composed the song.",
|
||||
"model 1 response": "\nThe artist composed the song.",
|
||||
"model 2 response": "\nThe artist composed the song."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide a word that rhymes with \"care.\"",
|
||||
"input": "",
|
||||
"output": "A word that rhymes with \"care\" is \"fare.\"",
|
||||
"model 1 response": "\nA word that rhymes with \"care\" is \"bare.\"",
|
||||
"model 2 response": "\nA word that rhymes with \"care\" is \"care-free.\""
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the given sentence to describe the same thing in a positive way.",
|
||||
"input": "The meal was not delicious.",
|
||||
"output": "The meal could use some improvement.",
|
||||
"model 1 response": "\nThe meal could use some improvement.",
|
||||
"model 2 response": "\nThe meal was delicious."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'elusive'.",
|
||||
"input": "",
|
||||
"output": "The meaning of the poem was elusive, escaping easy interpretation.",
|
||||
"model 1 response": "\nThe poem was elusive, elusive, and elusive.",
|
||||
"model 2 response": "\nThe elusive object that is so elusive that it is impossible to describe is the elusive cat."
|
||||
},
|
||||
{
|
||||
"instruction": "Arrange the given numbers in descending order.",
|
||||
"input": "7, 10, 3, 6, 9",
|
||||
"output": "10, 9, 7, 6, 3.",
|
||||
"model 1 response": "\n10 numbers in descending order: 3, 6, 3.",
|
||||
"model 2 response": "\nThe numbers in descending order are: 3, 6, 9."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert this direct speech to indirect speech.",
|
||||
"input": "'I am tired,' he said.",
|
||||
"output": "He said that he was tired.",
|
||||
"model 1 response": "\nHe said that he was tired.",
|
||||
"model 2 response": "\nHe said that he was tired."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the capital of Turkey?",
|
||||
"input": "",
|
||||
"output": "The capital of Turkey is Ankara.",
|
||||
"model 1 response": "\nThe capital of Turkey is Ankara.",
|
||||
"model 2 response": "\nThe capital of Turkey is Ankara."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to future tense.",
|
||||
"input": "They play football every Sunday.",
|
||||
"output": "They will play football every Sunday.",
|
||||
"model 1 response": "\nThey will play football every Sunday.",
|
||||
"model 2 response": "\nThey will play football every Sunday."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide a synonym for 'quick'.",
|
||||
"input": "",
|
||||
"output": "A synonym for 'quick' is 'fast'.",
|
||||
"model 1 response": "\nA synonym for 'quick' is 'fast'.",
|
||||
"model 2 response": "\nA synonym for 'quick' is 'fast'."
|
||||
},
|
||||
{
|
||||
"instruction": "What is a synonym of 'luminous'?",
|
||||
"input": "",
|
||||
"output": "A synonym for 'luminous' is 'bright'.",
|
||||
"model 1 response": "\nA synonym for 'luminous' is 'bright'.",
|
||||
"model 2 response": "\nA synonym for 'luminous' is 'bright'."
|
||||
},
|
||||
{
|
||||
"instruction": "Name the process by which plants lose water vapor through their leaves.",
|
||||
"input": "",
|
||||
"output": "The process by which plants lose water vapor through their leaves is called transpiration.",
|
||||
"model 1 response": "\nThe process is called evaporation.",
|
||||
"model 2 response": "\nThe process by which plants lose water vapor through their leaves is called transpiration."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the verb in the sentence.",
|
||||
"input": "Birds fly south for the winter.",
|
||||
"output": "The verb in the sentence is 'fly'.",
|
||||
"model 1 response": "\nThe verb in the sentence is \"fly.\"",
|
||||
"model 2 response": "\nThe verb in the sentence is 'fly'."
|
||||
},
|
||||
{
|
||||
"instruction": "What type of bond is formed when one atom donates an electron to another atom?",
|
||||
"input": "",
|
||||
"output": "The type of bond formed when one atom donates an electron to another atom is an ionic bond.",
|
||||
"model 1 response": "\nThe type of bond formed when one atom donates an electron to another atom is an ionic bond.",
|
||||
"model 2 response": "\nThe type of bond formed when one atom donates an electron to another atom is ionic."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to include a simile.",
|
||||
"input": "The car moved fast.",
|
||||
"output": "The car moved as fast as a cheetah.",
|
||||
"model 1 response": "\nThe car moved as fast as a cheetah.",
|
||||
"model 2 response": "\nThe car moved at a fast pace."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the correct spelling of the following word.",
|
||||
"input": "Necesary",
|
||||
"output": "The correct spelling is \"Necessary\".",
|
||||
"model 1 response": "\nThe correct spelling is \"Necesary\".",
|
||||
"model 2 response": "\nThe correct spelling of the word \"Necessary\" is \"Necessary\"."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite this sentence in the active voice.",
|
||||
"input": "A novel was written by Mary.",
|
||||
"output": "Mary wrote a novel.",
|
||||
"model 1 response": "\nMary wrote a novel.",
|
||||
"model 2 response": "\nMary wrote a novel."
|
||||
},
|
||||
{
|
||||
"instruction": "rewrite this sentence as a question.",
|
||||
"input": "Your brother went to London last year.",
|
||||
"output": "Did your brother go to London last year?",
|
||||
"model 1 response": "\nDid your brother go to London last year?",
|
||||
"model 2 response": "\nDid your brother go to London last year?"
|
||||
},
|
||||
{
|
||||
"instruction": "Explain what a haiku is.",
|
||||
"input": "",
|
||||
"output": "A haiku is a form of traditional Japanese poetry that consists of three lines with a syllable pattern of 5-7-5.",
|
||||
"model 1 response": "\nA haiku is a form of traditional Japanese poetry that consists of three lines with a syllable pattern of 5-7-5.",
|
||||
"model 2 response": "\nA haiku is a form of traditional Japanese poetry that consists of three lines that are composed of three syllables."
|
||||
},
|
||||
{
|
||||
"instruction": "Split the following sentence into two declarative sentences: \"The book is interesting but the cover is dull\".",
|
||||
"input": "",
|
||||
"output": "The book is interesting.\nThe cover is dull.",
|
||||
"model 1 response": "\nThe book is interesting.\nThe cover is dull.",
|
||||
"model 2 response": "\nThe book is dull."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the antonym of 'sparse'?",
|
||||
"input": "",
|
||||
"output": "The antonym of 'sparse' is 'dense'.",
|
||||
"model 1 response": "\nThe antonym of 'sparse' is 'dense'.",
|
||||
"model 2 response": "\nThe antonym of 'sparse' is 'dense'."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the type of sentence.",
|
||||
"input": "What a beautiful view!",
|
||||
"output": "The type of sentence is exclamatory.",
|
||||
"model 1 response": "\nThe type of sentence is exclamatory.",
|
||||
"model 2 response": "\nThe type of sentence is subjunctive."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the boiling point of mercury in Celsius?",
|
||||
"input": "",
|
||||
"output": "The boiling point of mercury is 356.7 degrees Celsius.",
|
||||
"model 1 response": "\nThe boiling point of mercury is approximately 100 degrees Celsius.",
|
||||
"model 2 response": "\nThe boiling point of mercury is 100 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "Define the term 'kinetic energy'.",
|
||||
"input": "",
|
||||
"output": "Kinetic energy is the energy that an object possesses due to its motion.",
|
||||
"model 1 response": "\nKinetic energy is the energy that an object possesses due to its motion.",
|
||||
"model 2 response": "\nKinetic energy is the energy that an object possesses due to its motion."
|
||||
},
|
||||
{
|
||||
"instruction": "Change the following sentence to past perfect tense.",
|
||||
"input": "They finish the game.",
|
||||
"output": "They had finished the game.",
|
||||
"model 1 response": "\nThey had finished the game.",
|
||||
"model 2 response": "\nThey will finish the game."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following items as either solid, liquid, or gas.",
|
||||
"input": "Mercury, oxygen, wood",
|
||||
"output": "Mercury - Liquid\nOxygen - Gas\nWood - Solid",
|
||||
"model 1 response": "\nMercury: Solid\nOxygen: Gas\nWood: Solid\nPlants: Vegetable\nMinerals: Iron\nMinerals: Gold\nMinerals: Coal\nMinerals: Bacteria: E. coli\nFruits: Cherries\nMinerals: Wheat\nMinerals: Vinegar\nMinerals: Fish\nMinerals: Mercury\nMinerals: Oxygen\nMinerals: Carbon\nMinerals: Wood\nMinerals: Gas\nMinerals: Carbon\nMinerals: Wood\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Solid\nMinerals: Gas\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\nMinerals: Solid\n",
|
||||
"model 2 response": "\nMercury is a solid. Oxygen is a liquid. Wood is a gas."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the synonym for 'enormous'?",
|
||||
"input": "",
|
||||
"output": "A synonym for 'enormous' is 'gigantic'.",
|
||||
"model 1 response": "\nThe synonym for 'enormous' is 'gigantic'.",
|
||||
"model 2 response": "\nThe synonym for 'enormous' is 'gigantic'."
|
||||
},
|
||||
{
|
||||
"instruction": "Translate the term 'sustainable development' into German.",
|
||||
"input": "",
|
||||
"output": "The German translation of 'sustainable development' is 'nachhaltige Entwicklung'.",
|
||||
"model 1 response": "\nThe German translation of 'sustainable development' is 'nachhaltige Entwicklung'.",
|
||||
"model 2 response": "\nThe German translation of 'sustainable development' is 'sustainable development f\u00fcr das Entwicklung'."
|
||||
},
|
||||
{
|
||||
"instruction": "Name three essential elements found in nature.",
|
||||
"input": "",
|
||||
"output": "1. Carbon\n2. Oxygen\n3. Iron",
|
||||
"model 1 response": "\n1. Carbon\n2. Oxygen\n3. Oxygen\n### Response:\n1. Carbon\n2. Oxygen\n3. Oxygen",
|
||||
"model 2 response": "\n1. Carbon\n2. Iron\n3. Titanium"
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following phrase as a question.",
|
||||
"input": "We must finish the project by tomorrow.",
|
||||
"output": "Must we finish the project by tomorrow?",
|
||||
"model 1 response": "\nMust we finish the project by tomorrow?",
|
||||
"model 2 response": "\nWe must finish the project by tomorrow."
|
||||
},
|
||||
{
|
||||
"instruction": "Combine the given sentences into one cohesive sentence.",
|
||||
"input": "It is raining. I cannot go to the park.",
|
||||
"output": "Since it is raining, I cannot go to the park.",
|
||||
"model 1 response": "\nIt is raining. I cannot go to the park.",
|
||||
"model 2 response": "\nIt is raining. I cannot go to the park."
|
||||
},
|
||||
{
|
||||
"instruction": "Calculate the volume of a cube with side length of 3 meters.",
|
||||
"input": "",
|
||||
"output": "The volume of the cube is 27 cubic meters.",
|
||||
"model 1 response": "\nThe volume of the cube is approximately 3 cubic meters.",
|
||||
"model 2 response": "\nThe volume of the cube is approximately 3.5 cubic meters."
|
||||
},
|
||||
{
|
||||
"instruction": "Re-word this sentence using an active voice.",
|
||||
"input": "The decision was made by the team.",
|
||||
"output": "The team made the decision.",
|
||||
"model 1 response": "\nThe team made the decision.",
|
||||
"model 2 response": "\nThe team made the decision."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to use an indefinite pronoun.",
|
||||
"input": "Someone left a message.",
|
||||
"output": "A message was left by someone.",
|
||||
"model 1 response": "\nA message was left by someone.",
|
||||
"model 2 response": "\nA message was left by someone."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following statement into one of these labels: [positive, negative, neutral]",
|
||||
"input": "My computer crashed.",
|
||||
"output": "The statement \"My computer crashed\" can be classified as negative.",
|
||||
"model 1 response": "\nThe statement \"My computer crashed\" can be classified as negative.",
|
||||
"model 2 response": "\nThe statement \"My computer crashed\" is classified as negative."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the derivative of x^2?",
|
||||
"input": "",
|
||||
"output": "The derivative of x^2 is 2x.",
|
||||
"model 1 response": "\nThe derivative of x^2 is 2x.",
|
||||
"model 2 response": "\nThe derivative of x^2 is 2x."
|
||||
},
|
||||
{
|
||||
"instruction": "Fill in the blanks with appropriate words.",
|
||||
"input": "The sky was filled with colorful ________ at sunset.",
|
||||
"output": "The sky was filled with colorful clouds at sunset.",
|
||||
"model 1 response": "\nThe sky was filled with colorful clouds at sunset.",
|
||||
"model 2 response": "\nThe sky was filled with colorful clouds at sunset."
|
||||
},
|
||||
{
|
||||
"instruction": "Change the sentence from passive to active voice.",
|
||||
"input": "The song was sung by the choir.",
|
||||
"output": "The choir sang the song.",
|
||||
"model 1 response": "\nThe choir sang the song.",
|
||||
"model 2 response": "\nThe choir sang the song."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the atomic mass of carbon?",
|
||||
"input": "",
|
||||
"output": "The atomic mass of carbon is approximately 12.01 atomic mass units.",
|
||||
"model 1 response": "\nThe atomic mass of carbon is approximately 3.01 atomic mass units.",
|
||||
"model 2 response": "\nThe atomic mass of carbon is approximately 3.0 atomic mass units."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to present perfect tense: 'He wins the race.'",
|
||||
"input": "He wins the race.",
|
||||
"output": "He has won the race.",
|
||||
"model 1 response": "\nHe has won the race.",
|
||||
"model 2 response": "\nHe will win the race."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to include an idiom.",
|
||||
"input": "It started raining suddenly.",
|
||||
"output": "It started raining cats and dogs.",
|
||||
"model 1 response": "\nIt started raining cats and dogs.",
|
||||
"model 2 response": "\nIt started raining suddenly."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the atomic number of helium?",
|
||||
"input": "",
|
||||
"output": "The atomic number of helium is 2.",
|
||||
"model 1 response": "\nThe atomic number of helium is 2.0.",
|
||||
"model 2 response": "\nThe atomic number of helium is 3."
|
||||
},
|
||||
{
|
||||
"instruction": "Name three types of clouds.",
|
||||
"input": "",
|
||||
"output": "Three types of clouds are cumulus, stratus, and cirrus.",
|
||||
"model 1 response": "\nThree types of clouds are cirrus, cirrus-like, and cirrus-like-ring.",
|
||||
"model 2 response": "\nThree types of clouds are cumulus, stratus, and cirrus."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the onomatopoeia in the sentence.",
|
||||
"input": "The bees buzzed as they flew past.",
|
||||
"output": "The onomatopoeia in the sentence is 'buzzed'.",
|
||||
"model 1 response": "\nThe onomatopoeia in the sentence is that the bees buzzed as they flew past.",
|
||||
"model 2 response": "\nThe onomatopoeia in the sentence is 'buzzed'."
|
||||
},
|
||||
{
|
||||
"instruction": "Name the process by which water changes from liquid to gas.",
|
||||
"input": "",
|
||||
"output": "The process by which water changes from liquid to gas is called evaporation.",
|
||||
"model 1 response": "\nThe process is called evaporation.",
|
||||
"model 2 response": "\nThe process by which water changes from liquid to gas is called evaporation."
|
||||
},
|
||||
{
|
||||
"instruction": "Find the synonym for the given verb.",
|
||||
"input": "Begin",
|
||||
"output": "Commence",
|
||||
"model 1 response": "\nCommence",
|
||||
"model 2 response": "\nBegin\n\n### Input:\nbegin\n\n### Response:\nBegin\n\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin\n### Response:\nbegin\n### Input:\nbegin"
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following numbers as prime or composite.",
|
||||
"input": ": 11, 14, 19.",
|
||||
"output": "Prime numbers: 11, 19\nComposite numbers: 14",
|
||||
"model 1 response": "\nPrime numbers: 11, 19\n\nComposite numbers: 14\n\n### Response:\nPrime numbers: 11, 19\nComposite numbers: 14",
|
||||
"model 2 response": "\nPrime numbers: 11, 19, 19. Composite numbers: 14, 19, 19."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the item as either a grain or a fruit.",
|
||||
"input": "Wheat",
|
||||
"output": "Grain",
|
||||
"model 1 response": "\nGrain\n\n### Response:\nGrain\n### Response:\nWheat\n### Response:\nGrain",
|
||||
"model 2 response": "\nGrain"
|
||||
},
|
||||
{
|
||||
"instruction": "Provide the plural form of 'thesis'.",
|
||||
"input": "",
|
||||
"output": "The plural form of 'thesis' is 'theses'.",
|
||||
"model 1 response": "\nThe plural form of 'thesis' is 'theses'.",
|
||||
"model 2 response": "\nThe plural form of 'thesis' is 'thesis'."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate three verbs that mean the same as \"to run.\"",
|
||||
"input": "",
|
||||
"output": "1. To sprint\n2. To dash\n3. To jog",
|
||||
"model 1 response": "\n1. To sprint\n2. To dash\n3. To dash\n4. To dash\n5. To dash\n6. To dash",
|
||||
"model 2 response": "\n1. To sprint\n2. To dash\n3. To jog"
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a question that could logically follow the statement below.",
|
||||
"input": "I just returned from a vacation in Spain.",
|
||||
"output": "How was your vacation in Spain?",
|
||||
"model 1 response": "\nHow was your vacation in Spain?",
|
||||
"model 2 response": "\nWhat was your vacation in Spain?"
|
||||
},
|
||||
{
|
||||
"instruction": "What is the normal boiling point of water?",
|
||||
"input": "",
|
||||
"output": "The normal boiling point of water is 100 degrees Celsius.",
|
||||
"model 1 response": "\nThe normal boiling point of water is 100 degrees Celsius.",
|
||||
"model 2 response": "\nThe normal boiling point of water is 100 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to active voice.",
|
||||
"input": "The book was read by many.",
|
||||
"output": "Many people read the book.",
|
||||
"model 1 response": "\nMany people read the book.",
|
||||
"model 2 response": "\nMany people read the book."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the tense used in the sentence.",
|
||||
"input": "They are watching a movie.",
|
||||
"output": "The tense used in the sentence \"They are watching a movie\" is the present continuous tense.",
|
||||
"model 1 response": "\nThe tense used in the sentence is future continuous.",
|
||||
"model 2 response": "\nThe tense used in the sentence is future continuous."
|
||||
},
|
||||
{
|
||||
"instruction": "Explain what GDP stands for and what it measures.",
|
||||
"input": "",
|
||||
"output": "GDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders.",
|
||||
"model 1 response": "\nGDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders.",
|
||||
"model 2 response": "\nGDP stands for Gross Domestic Product, which measures the total value of all goods and services produced over a specific time period within a country's borders."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the verb in the following sentence",
|
||||
"input": "The sun shines brightly.",
|
||||
"output": "The verb in the sentence is 'shines'.",
|
||||
"model 1 response": "\nThe verb in the sentence is \"shines.\"",
|
||||
"model 2 response": "\nThe verb in the sentence is 'shines'."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to include a metaphor.'",
|
||||
"input": "Life can be hard.",
|
||||
"output": "Life is a rocky road.",
|
||||
"model 1 response": "\nLife is a rocky road.",
|
||||
"model 2 response": "\nLife is a rocky road."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the derivative of the function f(x) = x^4?",
|
||||
"input": "",
|
||||
"output": "The derivative of the function f(x) = x^4 is f'(x) = 4x^3.",
|
||||
"model 1 response": "\nThe derivative of the function f(x) = x^4 is f'(x) = 4x^3.",
|
||||
"model 2 response": "\nThe derivative of the function f(x) = x^4 is f'(x) = 4x."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the temperature from Celsius to Kelvin.",
|
||||
"input": "25 degrees Celsius",
|
||||
"output": "25 degrees Celsius is 298.15 Kelvin.",
|
||||
"model 1 response": "\n25 degrees Celsius is approximately 298.15 Kelvin.",
|
||||
"model 2 response": "\n25 degrees Celsius is approximately 298.15 Kelvin."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following sentence to include a hyperbole: 'I'm very hungry.'",
|
||||
"input": "I'm very hungry.",
|
||||
"output": "I'm so hungry I could eat a horse.",
|
||||
"model 1 response": "\nI'm very hungry.",
|
||||
"model 2 response": "\nI'm very hungry."
|
||||
},
|
||||
{
|
||||
"instruction": "List the ingredients required to make a basic salad.",
|
||||
"input": "",
|
||||
"output": "Ingredients for a basic salad typically include lettuce, tomatoes, cucumbers, and dressing.",
|
||||
"model 1 response": "\nIngredients for a basic salad typically include lettuce, tomatoes, cucumbers, and cucumbers.",
|
||||
"model 2 response": "\nIngredients for a basic salad typically include lettuce, tomatoes, dressing, and dressingings."
|
||||
},
|
||||
{
|
||||
"instruction": "Select the noun phrase in the following sentence.",
|
||||
"input": "He enjoyed a delicious cake at the party.",
|
||||
"output": "The noun phrases in the given sentence are \"a delicious cake\" and \"the party.\"",
|
||||
"model 1 response": "\nThe noun phrases in the sentence are \"a delicious cake\" and \"the party.\"",
|
||||
"model 2 response": "\nThe noun phrases in the sentence are \"a delicious cake\" and \"the party.\""
|
||||
},
|
||||
{
|
||||
"instruction": "What is the contraction for \"will not\"?",
|
||||
"input": "",
|
||||
"output": "The contraction for \"will not\" is \"won't\".",
|
||||
"model 1 response": "\nThe contraction for \"will not\" is \"won't\".",
|
||||
"model 2 response": "\nThe contraction for \"will not\" is \"won't\"."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'elucidate'.",
|
||||
"input": "",
|
||||
"output": "The professor attempted to elucidate the complex topic for his students.",
|
||||
"model 1 response": "\nThe professor elucidated the complex topic for his students.",
|
||||
"model 2 response": "\nThe professor elucidated the topic of the class."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert this sentence to passive voice",
|
||||
"input": "The chef cooked a delicious meal.",
|
||||
"output": "A delicious meal was cooked by the chef.",
|
||||
"model 1 response": "\nA delicious meal was cooked by the chef.",
|
||||
"model 2 response": "\nThe chef cooked a delicious meal."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the pH of lemon juice?",
|
||||
"input": "",
|
||||
"output": "The pH of lemon juice is approximately 2.0.",
|
||||
"model 1 response": "\nThe pH of lemon juice is approximately 2.0.",
|
||||
"model 2 response": "\nThe pH of lemon juice is typically around 3.0."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the capital of Norway?",
|
||||
"input": "",
|
||||
"output": "The capital of Norway is Oslo.",
|
||||
"model 1 response": "\nThe capital of Norway is Oslo.",
|
||||
"model 2 response": "\nThe capital of Norway is Oslo."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a question with the keyword \u201cbacteria\u201d",
|
||||
"input": "",
|
||||
"output": "How do bacteria affect human health?",
|
||||
"model 1 response": "\nWhat is the pH of baking soda solution?\n\n### Response:\nThe pH of a baking soda solution is typically around 2.0.",
|
||||
"model 2 response": "\nWhat is the pH of a neutral solution at 25\u00b0C?\n\n### Input:\nThe pH of a neutral solution at 25\u00b0C is typically around 3.0."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a sentence using the word 'spontaneous'.",
|
||||
"input": "",
|
||||
"output": "His decision to travel was completely spontaneous, made on a whim.",
|
||||
"model 1 response": "\nHis decision to travel was completely spontaneous, made on a whim.",
|
||||
"model 2 response": "\nThe decision was made by the team."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to future perfect tense.",
|
||||
"input": "They close the store.",
|
||||
"output": "They will have closed the store.",
|
||||
"model 1 response": "\nThey will have closed the store.",
|
||||
"model 2 response": "\nThey will have closed the store."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide the past participle form of 'break'.",
|
||||
"input": "",
|
||||
"output": "The past participle form of 'break' is 'broken'.",
|
||||
"model 1 response": "\nThe past participle form of 'break' is 'broken'.",
|
||||
"model 2 response": "\nThe past participle form of 'break' is 'broken'."
|
||||
},
|
||||
{
|
||||
"instruction": "Provide the correct contraction for 'they are'.",
|
||||
"input": "",
|
||||
"output": "The contraction for 'they are' is 'they're'.",
|
||||
"model 1 response": "\nThe contraction for 'they are' is 'they're'.",
|
||||
"model 2 response": "\nThe contraction for 'they are' is 'they're'."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the temperature from Fahrenheit to Celsius: 68\u00b0F.",
|
||||
"input": "68 degrees Fahrenheit",
|
||||
"output": "68 degrees Fahrenheit is approximately 20 degrees Celsius.",
|
||||
"model 1 response": "\n68 degrees Fahrenheit is approximately 37 degrees Celsius.",
|
||||
"model 2 response": "\n68 degrees Fahrenheit is approximately 37 degrees Celsius."
|
||||
},
|
||||
{
|
||||
"instruction": "Turn the following statement into a question.",
|
||||
"input": "The Golden State Warriors are a professional basketball team based in San Francisco.",
|
||||
"output": "Are the Golden State Warriors a professional basketball team based in San Francisco?",
|
||||
"model 1 response": "\nAre the Golden State Warriors a professional basketball team based in San Francisco?",
|
||||
"model 2 response": "\nAre the Golden State Warriors a professional basketball team based in San Francisco?"
|
||||
},
|
||||
{
|
||||
"instruction": "What is the derivative of the function f(x) = 3x^3?",
|
||||
"input": "",
|
||||
"output": "The derivative of the function f(x) = 3x^3 is f'(x) = 9x^2.",
|
||||
"model 1 response": "\nThe derivative of the function f(x) = 3x^3 is f'(x) = 6x.",
|
||||
"model 2 response": "\nThe derivative of the function f(x) = 3x^3 is f'(x) = 4x."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite the following phrase in the present tense.",
|
||||
"input": "They had played soccer all afternoon.",
|
||||
"output": "They play soccer all afternoon.",
|
||||
"model 1 response": "\nThey play soccer all afternoon.",
|
||||
"model 2 response": "\nThey played soccer all afternoon."
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite this sentence to start with \"Despite\": She managed to finish her work, although she was tired.",
|
||||
"input": "",
|
||||
"output": "Despite being tired, she managed to finish her work.",
|
||||
"model 1 response": "\nDespite being tired, she managed to finish her work.",
|
||||
"model 2 response": "\nDespite being tired, she managed to finish her work."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence to future continuous tense.",
|
||||
"input": "She plays the piano.",
|
||||
"output": "She will be playing the piano.",
|
||||
"model 1 response": "\nShe will play the piano.",
|
||||
"model 2 response": "\nShe will be playing the piano."
|
||||
},
|
||||
{
|
||||
"instruction": "Capitalize each word in the sentence.",
|
||||
"input": "i love watching the sunrise",
|
||||
"output": "I Love Watching The Sunrise",
|
||||
"model 1 response": "\nI love watching the sunrise.",
|
||||
"model 2 response": "\nI love watching the sunrise."
|
||||
},
|
||||
{
|
||||
"instruction": "What is the antonym of 'complicated'?",
|
||||
"input": "",
|
||||
"output": "The antonym of 'complicated' is 'simple'.",
|
||||
"model 1 response": "\nThe antonym of 'complicated' is 'simple'.",
|
||||
"model 2 response": "\nThe antonym of 'complicated' is 'simple'."
|
||||
},
|
||||
{
|
||||
"instruction": "Sort the following numbers in ascending order.",
|
||||
"input": "5, 18, 3, 11.",
|
||||
"output": "The numbers in ascending order would be: 3, 5, 11, 18.",
|
||||
"model 1 response": "\nThe numbers in ascending order are: 3, 5, 11.",
|
||||
"model 2 response": "\nThe numbers in ascending order are: 3, 11, 5, 18."
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the theme of the story.",
|
||||
"input": "The Tortoise and the Hare",
|
||||
"output": "The theme of the story is that slow and steady wins the race.",
|
||||
"model 1 response": "\nThe theme of the story is that slow and steady wins the race.",
|
||||
"model 2 response": "\nThe theme of the story is that the Tortoise and the Hare are the two most popular animals in the world."
|
||||
},
|
||||
{
|
||||
"instruction": "Correct the verb tense error in the sentence.",
|
||||
"input": "She go to school every day.",
|
||||
"output": "The corrected sentence should be: 'She goes to school every day.'",
|
||||
"model 1 response": "\nThe corrected sentence should be: 'She goes to school every day.'",
|
||||
"model 2 response": "\nThe corrected sentence should be: 'She goes to school every day.'"
|
||||
},
|
||||
{
|
||||
"instruction": "Evaluate this sentence for spelling and grammar mistakes.",
|
||||
"input": "I recieved an invitation to an exciting evenement.",
|
||||
"output": "There are two errors in the sentence. The corrected sentence should be: \"I received an invitation to an exciting event.\"",
|
||||
"model 1 response": "\nAn invitation to an exciting event?",
|
||||
"model 2 response": "\nThe corrected sentence should be: 'I received an invitation to an exciting event.'"
|
||||
},
|
||||
{
|
||||
"instruction": "Identify the hypernym for the word \"car\".",
|
||||
"input": "",
|
||||
"output": "The hypernym for the word \"car\" is \"vehicle\".",
|
||||
"model 1 response": "\nThe hypernym for the word \"car\" is \"vehicle\".",
|
||||
"model 2 response": "\nThe hypernym for the word \"car\" is \"vehicle\"."
|
||||
},
|
||||
{
|
||||
"instruction": "Name an antonym for \"joy.\"",
|
||||
"input": "",
|
||||
"output": "An antonym for \"joy\" is \"sorrow.\"",
|
||||
"model 1 response": "\nAn antonym for \"joy\" is \"sorrow.\"",
|
||||
"model 2 response": "\nAn antonym for \"joy\" is \"contentment.\""
|
||||
},
|
||||
{
|
||||
"instruction": "Convert the sentence into a question",
|
||||
"input": "She likes ice cream.",
|
||||
"output": "Does she like ice cream?",
|
||||
"model 1 response": "\nDoes she like ice cream?",
|
||||
"model 2 response": "\nDoes she like ice cream?"
|
||||
},
|
||||
{
|
||||
"instruction": "Rewrite this sentence in the form of a question.",
|
||||
"input": "This is the biggest tree in the park.",
|
||||
"output": "Is this the biggest tree in the park?",
|
||||
"model 1 response": "\nIs this the biggest tree in the park?",
|
||||
"model 2 response": "\nIs this the biggest tree in the park?"
|
||||
},
|
||||
{
|
||||
"instruction": "Find a rhyme for the word 'sing'.",
|
||||
"input": "",
|
||||
"output": "A word that rhymes with 'sing' is 'ring'.",
|
||||
"model 1 response": "\nA word that rhymes with 'ring' is 'ring-a'.",
|
||||
"model 2 response": "\nThe word 'sing' is a derivative of the word 'sing'."
|
||||
},
|
||||
{
|
||||
"instruction": "Convert this sentence to future perfect tense.",
|
||||
"input": "She will finish the project.",
|
||||
"output": "She will have finished the project.",
|
||||
"model 1 response": "\nShe will have finished the project.",
|
||||
"model 2 response": "\nShe will have finished the project."
|
||||
},
|
||||
{
|
||||
"instruction": "What type of number is 0?",
|
||||
"input": "",
|
||||
"output": "0 is an integer.",
|
||||
"model 1 response": "\n0 is an integer.",
|
||||
"model 2 response": "\n0 is an integer."
|
||||
},
|
||||
{
|
||||
"instruction": "Classify the following animals",
|
||||
"input": "Eagle, Shark, Cobra.",
|
||||
"output": "Birds: Eagle\nFish: Shark\nReptiles: Cobra",
|
||||
"model 1 response": "\nBirds: Eagle\nFish: Shark\nReptiles: Cobra\nReptiles: Squirrel\nReptiles: Tiger\nReptiles: Cobra\nReptiles: Tiger\nReptiles: Squirrel\nReptiles: Tiger\nReptiles: Cobra\nReptiles: Tiger\nReptiles: Cobra",
|
||||
"model 2 response": "\nBirds: Eagle\nFish: Shark\nReptiles: Cobra\nPlants: Eagle\nMinerals: Eagle\nFruits: Shark\nFruits: Shark"
|
||||
},
|
||||
{
|
||||
"instruction": "Take the following noun and turn it into a verb.",
|
||||
"input": "Light",
|
||||
"output": "To light.",
|
||||
"model 1 response": "\nTo light.",
|
||||
"model 2 response": "\nLight is the most abundant element in the periodic table."
|
||||
},
|
||||
{
|
||||
"instruction": "Generate a random alphanumeric code of 10 characters.",
|
||||
"input": "",
|
||||
"output": "Here is a randomly generated 10-character alphanumeric code: 3F8G6H7JQ2.",
|
||||
"model 1 response": "\nHere is a randomly generated sequence of 10 characters: 3F8G6H7JQHWQ2JQ3H6H-\n\n### Response:\nHere is a sequence of 10 characters: 3F8G6H7JQHWQ2JQ3H6H-",
|
||||
"model 2 response": "\nThe alphanumeric code for the word 'happy' is 'happiness'."
|
||||
}
|
||||
]
|
||||
648
Codes/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb
Normal file
648
Codes/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb
Normal file
@ -0,0 +1,648 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses an 8-billion-parameter Llama 3 model through ollama to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{\n",
|
||||
" \"instruction\": \"What is the atomic number of helium?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
|
||||
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
|
||||
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
|
||||
"},\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tqdm version: 4.66.4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\"tqdm\", # Progress bar\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Installing Ollama and Downloading Llama 3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Ollama is an application to run LLMs efficiently\n",
|
||||
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
|
||||
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
|
||||
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
|
||||
"- Linux users can use the installation command provided on the ollama website\n",
|
||||
"\n",
|
||||
"- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1\">\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 8-billion-parameter Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# 8B model\n",
|
||||
"ollama run llama3\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The output looks like as follows:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"$ ollama run llama3\n",
|
||||
"pulling manifest \n",
|
||||
"pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB \n",
|
||||
"pulling 4fa551d4f938... 100% ▕████████████████▏ 12 KB \n",
|
||||
"pulling 8ab4849b038c... 100% ▕████████████████▏ 254 B \n",
|
||||
"pulling 577073ffcc6c... 100% ▕████████████████▏ 110 B \n",
|
||||
"pulling 3f8eb4da87fa... 100% ▕████████████████▏ 485 B \n",
|
||||
"verifying sha256 digest \n",
|
||||
"writing manifest \n",
|
||||
"removing any unused layers \n",
|
||||
"success \n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- Note that `llama3` refers to the instruction finetuned 8-billion-parameter Llama 3 model\n",
|
||||
"\n",
|
||||
"- Alternatively, you can also use the larger 70-billion-parameter Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
|
||||
"\n",
|
||||
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
|
||||
"\n",
|
||||
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
">>> What do llamas eat?\n",
|
||||
"Llamas are ruminant animals, which means they have a four-chambered \n",
|
||||
"stomach and eat plants that are high in fiber. In the wild, llamas \n",
|
||||
"typically feed on:\n",
|
||||
"1. Grasses: They love to graze on various types of grasses, including tall \n",
|
||||
"grasses, wheat, oats, and barley.\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- You can end this session using the input `/bye`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Ollama's REST API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
|
||||
"- Before you run the next cells in this notebook, make sure that ollama is still running, as described above, via\n",
|
||||
" - `ollama serve` in a terminal\n",
|
||||
" - the ollama application\n",
|
||||
"- Next, run the following code cell to query the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
|
||||
"2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.\n",
|
||||
"3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.\n",
|
||||
"4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.\n",
|
||||
"5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.\n",
|
||||
"\n",
|
||||
"In the wild, llamas might also eat:\n",
|
||||
"\n",
|
||||
"1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.\n",
|
||||
"2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.\n",
|
||||
"3. Mosses and lichens: These non-vascular plants can be a tasty snack for llamas.\n",
|
||||
"\n",
|
||||
"In captivity, llama owners typically provide a balanced diet that includes a mix of hay, grains, and fruits/vegetables. It's essential to consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\"):\n",
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"messages\": [\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": prompt\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" \"options\": { # Settings below are required for deterministic responses\n",
|
||||
" \"seed\": 123,\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" \"num_ctx\": 2048\n",
|
||||
" }\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
" payload = json.dumps(data).encode(\"utf-8\")\n",
|
||||
"\n",
|
||||
" # Create a request object, setting the method to POST and adding necessary headers\n",
|
||||
" request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
|
||||
" request.add_header(\"Content-Type\", \"application/json\")\n",
|
||||
"\n",
|
||||
" # Send the request and capture the response\n",
|
||||
" response_data = \"\"\n",
|
||||
" with urllib.request.urlopen(request) as response:\n",
|
||||
" # Read and decode the response\n",
|
||||
" while True:\n",
|
||||
" line = response.readline().decode(\"utf-8\")\n",
|
||||
" if not line:\n",
|
||||
" break\n",
|
||||
" response_json = json.loads(line)\n",
|
||||
" response_data += response_json[\"message\"][\"content\"]\n",
|
||||
"\n",
|
||||
" return response_data\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"result = query_model(\"What do Llamas eat?\")\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's get to the data evaluation part\n",
|
||||
"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 100\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_file = \"eval-example-data.json\"\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
"\n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
|
||||
" 'input': '',\n",
|
||||
" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
|
||||
" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
|
||||
" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Below is a small utility function that formats the input for visualization purposes later:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def format_input(entry):\n",
|
||||
" instruction_text = (\n",
|
||||
" f\"Below is an instruction that describes a task. Write a response that \"\n",
|
||||
" f\"appropriately completes the request.\"\n",
|
||||
" f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
|
||||
" instruction_text + input_text\n",
|
||||
"\n",
|
||||
" return instruction_text + input_text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's try the ollama API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The hypotenuse of the triangle is 10 cm.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd score this response as 0 out of 100.\n",
|
||||
"\n",
|
||||
"The correct answer is \"The hypotenuse of the triangle is 10 cm.\", not \"3 cm.\". The model failed to accurately calculate the length of the hypotenuse, which is a fundamental concept in geometry and trigonometry.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> 1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"1. Squirrel\n",
|
||||
"2. Tiger\n",
|
||||
"3. Eagle\n",
|
||||
"4. Cobra\n",
|
||||
"5. Tiger\n",
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd rate this model response as 60 out of 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"* The model correctly identifies two animals that are active during the day: Squirrel and Eagle.\n",
|
||||
"* However, it incorrectly includes Tiger twice, which is not a different animal from the original list.\n",
|
||||
"* Cobra is also an incorrect answer, as it is typically nocturnal or crepuscular (active at twilight).\n",
|
||||
"* The response does not meet the instruction to provide three different animals that are active during the day.\n",
|
||||
"\n",
|
||||
"To achieve a higher score, the model should have provided three unique and correct answers that fit the instruction.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> I must ascertain what is incorrect.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> A clever test!\n",
|
||||
"\n",
|
||||
"Here's my attempt at rewriting the sentence in a more formal way:\n",
|
||||
"\n",
|
||||
"\"I require an identification of the issue.\"\n",
|
||||
"\n",
|
||||
"Now, let's evaluate the model response \"What is incorrect?\" against the correct output \"I must ascertain what is incorrect.\".\n",
|
||||
"\n",
|
||||
"To me, this seems like a completely different question being asked. The original instruction was to rewrite the sentence in a more formal way, and the model response doesn't even attempt to do that. It's asking a new question altogether!\n",
|
||||
"\n",
|
||||
"So, I'd score this response a 0 out of 100.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd score this model response as 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"1. The instruction asks to identify the interjection in the sentence.\n",
|
||||
"2. The input sentence is provided: \"Wow, that was an amazing trick!\"\n",
|
||||
"3. The model correctly identifies the interjection as \"Wow\", which is a common English interjection used to express surprise or excitement.\n",
|
||||
"4. The response accurately answers the question and provides the correct information.\n",
|
||||
"\n",
|
||||
"Overall, the model's response perfectly completes the request, making it a 100% accurate answer!\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The type of sentence is interrogative.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> I'd rate this model response as 20 out of 100.\n",
|
||||
"\n",
|
||||
"Here's why:\n",
|
||||
"\n",
|
||||
"* The input sentence \"Did you finish the report?\" is indeed an interrogative sentence, which asks a question.\n",
|
||||
"* The model response says it's exclamatory, which is incorrect. Exclamatory sentences are typically marked by an exclamation mark (!) and express strong emotions or emphasis, whereas this sentence is simply asking a question.\n",
|
||||
"\n",
|
||||
"The correct output \"The type of sentence is interrogative.\" is the best possible score (100), while the model response is significantly off the mark, hence the low score.\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for entry in json_data[:5]:\n",
|
||||
" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry['model 1 response']}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" )\n",
|
||||
" print(\"\\nDataset response:\")\n",
|
||||
" print(\">>\", entry['output'])\n",
|
||||
" print(\"\\nModel response:\")\n",
|
||||
" print(\">>\", entry[\"model 1 response\"])\n",
|
||||
" print(\"\\nScore:\")\n",
|
||||
" print(\">>\", query_model(prompt))\n",
|
||||
" print(\"\\n-------------------------\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def generate_model_scores(json_data, json_key):\n",
|
||||
" scores = []\n",
|
||||
" for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
|
||||
" prompt = (\n",
|
||||
" f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry[json_key]}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" f\"Respond with the integer number only.\"\n",
|
||||
" )\n",
|
||||
" score = query_model(prompt)\n",
|
||||
" try:\n",
|
||||
" scores.append(int(score))\n",
|
||||
" except ValueError:\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" return scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b071ce84-1866-427f-a272-b46700f364b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on an M3 MacBook Air laptop)\n",
|
||||
"- Note that ollama is not fully deterministic across operating systems (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:02<00:00, 1.59it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 78.48\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00, 1.42it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 99 of 100\n",
|
||||
"Average score: 64.98\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"for model in (\"model 1 response\", \"model 2 response\"):\n",
|
||||
"\n",
|
||||
" scores = generate_model_scores(json_data, model)\n",
|
||||
" print(f\"\\n{model}\")\n",
|
||||
" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
|
||||
"\n",
|
||||
" # Optionally save the scores\n",
|
||||
" save_path = Path(\"scores\") / f\"llama3-8b-{model.replace(' ', '-')}.json\"\n",
|
||||
" with open(save_path, \"w\") as file:\n",
|
||||
" json.dump(scores, file)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Based on the evaluation above, we can say that the 1st model is better than the 2nd model"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
571
Codes/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb
Normal file
571
Codes/ch07/03_model-evaluation/llm-instruction-eval-openai.ipynb
Normal file
@ -0,0 +1,571 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Evaluating Instruction Responses Using the OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses OpenAI's GPT-4 API to evaluate responses by a instruction finetuned LLMs based on an dataset in JSON format that includes the generated model responses, for example:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{\n",
|
||||
" \"instruction\": \"What is the atomic number of helium?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The atomic number of helium is 2.\", # <-- The target given in the test set\n",
|
||||
" \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- Response by an LLM\n",
|
||||
" \"model 2 response\": \"\\nThe atomic number of helium is 3.\" # <-- Response by a 2nd LLM\n",
|
||||
"},\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# pip install -r requirements-extra.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"openai version: 1.30.3\n",
|
||||
"tqdm version: 4.66.2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\"openai\", # OpenAI API\n",
|
||||
" \"tqdm\", # Progress bar\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Test OpenAI API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's test if the OpenAI API is correctly set up\n",
|
||||
"- If you don't have an account yet, you need to create one at https://platform.openai.com/\n",
|
||||
"- Note that you will also have to transfer some funds to your account as the GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)\n",
|
||||
"- Running the experiments and creating the ~200 evaluations using the code in this notebook costs about $0.26 (26 cents) as of this writing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, we need to provide our OpenAI API secret key, which can be found at https://platform.openai.com/api-keys\n",
|
||||
"- Make sure not to share this key with anyone\n",
|
||||
"- Add this secret key (`\"sk-...\"`) to the `config.json` file in this folder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from openai import OpenAI\n",
|
||||
"\n",
|
||||
"# Load API key from a JSON file.\n",
|
||||
"# Make sure to replace \"sk-...\" with your actual API key from https://platform.openai.com/api-keys\n",
|
||||
"with open(\"config.json\", \"r\") as config_file:\n",
|
||||
" config = json.load(config_file)\n",
|
||||
" api_key = config[\"OPENAI_API_KEY\"]\n",
|
||||
"\n",
|
||||
"client = OpenAI(api_key=api_key)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "08e9ef2e-e816-4283-840e-43625791ad33",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'hello world'"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
|
||||
" response = client.chat.completions.create(\n",
|
||||
" model=model,\n",
|
||||
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
|
||||
" temperature=0.0,\n",
|
||||
" seed=123,\n",
|
||||
" )\n",
|
||||
" return response.choices[0].message.content\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"prompt = f\"Respond with 'hello world' if you got this message.\"\n",
|
||||
"run_chatgpt(prompt, client)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 100\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_file = \"eval-example-data.json\"\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
"\n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'model 1 response'` and `'model 2 response'`):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
|
||||
" 'input': '',\n",
|
||||
" 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
|
||||
" 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
|
||||
" 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Below is a small utility function that formats the input for visualization purposes later:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def format_input(entry):\n",
|
||||
" instruction_text = (\n",
|
||||
" f\"Below is an instruction that describes a task. Write a response that \"\n",
|
||||
" f\"appropriately completes the request.\"\n",
|
||||
" f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
|
||||
" instruction_text + input_text\n",
|
||||
"\n",
|
||||
" return instruction_text + input_text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's try the OpenAI API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The hypotenuse of the triangle is 10 cm.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The hypotenuse of the triangle is 3 cm.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:\n",
|
||||
"\n",
|
||||
"\\[ c = \\sqrt{a^2 + b^2} \\]\n",
|
||||
"\\[ c = \\sqrt{6^2 + 8^2} \\]\n",
|
||||
"\\[ c = \\sqrt{36 + 64} \\]\n",
|
||||
"\\[ c = \\sqrt{100} \\]\n",
|
||||
"\\[ c = 10 \\text{ cm} \\]\n",
|
||||
"\n",
|
||||
"The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.\n",
|
||||
"\n",
|
||||
"Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to its inaccuracy. However, since the response format is correct (stating the measurement and unit), it does not score the absolute minimum.\n",
|
||||
"\n",
|
||||
"**Score: 10/100**\n",
|
||||
"\n",
|
||||
"This score reflects that while the format of the response is correct, the content is highly inaccurate.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> 1. Squirrel\n",
|
||||
"2. Eagle\n",
|
||||
"3. Tiger\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"1. Squirrel\n",
|
||||
"2. Tiger\n",
|
||||
"3. Eagle\n",
|
||||
"4. Cobra\n",
|
||||
"5. Tiger\n",
|
||||
"6. Cobra\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response lists six animals, three of which (squirrel, tiger, eagle) are indeed active during the day, making them correct responses to the instruction. However, the instruction specifically asked for three different animals, and the model response includes repetitions (tiger and cobra are each listed twice) and also exceeds the requested number of animals.\n",
|
||||
"\n",
|
||||
"The inclusion of \"cobra\" is incorrect as most cobras are not diurnal (active during the day); they are generally more active during the early morning and late evening, which can be considered crepuscular rather than diurnal.\n",
|
||||
"\n",
|
||||
"### Scoring Breakdown:\n",
|
||||
"- **Relevance to the task**: The response correctly identifies three diurnal animals but also includes additional animals, which was not requested.\n",
|
||||
"- **Accuracy**: Including animals not active during the day (cobra) and repeating animals reduces the accuracy.\n",
|
||||
"- **Adherence to instructions**: The task was to name three different animals, but the response included six names with repetitions.\n",
|
||||
"\n",
|
||||
"Given these points, the response partially meets the requirements but also deviates significantly in terms of the number of animals and the inclusion of incorrect and repeated entries.\n",
|
||||
"\n",
|
||||
"### Score: 50/100\n",
|
||||
"This score reflects that while the response did include three correct animals, it failed to strictly follow the instructions by listing only three different animals and included incorrect information.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> I must ascertain what is incorrect.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"What is incorrect?\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"What is incorrect?\" scores low in terms of fulfilling the instruction to rewrite the sentence in a more formal way. The original sentence \"I need to find out what's wrong.\" expresses a personal obligation and a process of discovery, which is not captured in the model response. The model response turns the sentence into a direct question and loses the nuance of needing to discover or investigate the issue.\n",
|
||||
"\n",
|
||||
"**Score: 20/100**\n",
|
||||
"\n",
|
||||
"**Reasoning:**\n",
|
||||
"- **Formality:** The response is slightly more formal than casual speech but does not elevate the formality significantly or appropriately. It does use \"incorrect\" which is slightly more formal than \"wrong.\"\n",
|
||||
"- **Completeness:** The response fails to include the aspect of needing to find out or ascertain, which is a critical part of the original sentence.\n",
|
||||
"- **Accuracy:** The response changes the structure and intent by converting it into a direct question, which does not align with the instruction to rewrite the statement while maintaining its original intent.\n",
|
||||
"\n",
|
||||
"Overall, the response does not adequately meet the requirements of the task as it significantly alters the meaning and omits key elements of the original sentence.\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The interjection in the sentence is 'Wow'.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence. Therefore, the response is fully correct and meets all the requirements of the task.\n",
|
||||
"\n",
|
||||
"Score: 100/100\n",
|
||||
"\n",
|
||||
"-------------------------\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The type of sentence is interrogative.\n",
|
||||
"\n",
|
||||
"Model response:\n",
|
||||
">> \n",
|
||||
"The type of sentence is exclamatory.\n",
|
||||
"\n",
|
||||
"Score:\n",
|
||||
">> The model response \"The type of sentence is exclamatory.\" is incorrect. The input sentence \"Did you finish the report?\" is clearly an interrogative sentence as it is asking a question, indicated by the question mark at the end and the structure of the sentence.\n",
|
||||
"\n",
|
||||
"Given the scoring criteria where 100 is the best score and should be awarded to a correct and precise response, the model's response should receive a low score because it incorrectly identifies the type of sentence. An exclamatory sentence typically expresses strong emotion and ends with an exclamation mark, which is not the case here.\n",
|
||||
"\n",
|
||||
"Therefore, the score for the model response would be 0 out of 100, as it completely misidentifies the type of sentence, providing incorrect information.\n",
|
||||
"\n",
|
||||
"-------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for entry in json_data[:5]:\n",
|
||||
" prompt = (f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry['model 1 response']}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" )\n",
|
||||
" print(\"\\nDataset response:\")\n",
|
||||
" print(\">>\", entry['output'])\n",
|
||||
" print(\"\\nModel response:\")\n",
|
||||
" print(\">>\", entry[\"model 1 response\"])\n",
|
||||
" print(\"\\nScore:\")\n",
|
||||
" print(\">>\", run_chatgpt(prompt, client))\n",
|
||||
" print(\"\\n-------------------------\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def generate_model_scores(json_data, json_key, client):\n",
|
||||
" scores = []\n",
|
||||
" for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
|
||||
" prompt = (\n",
|
||||
" f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"score the model response `{entry[json_key]}`\"\n",
|
||||
" f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
|
||||
" f\"Respond with the number only.\"\n",
|
||||
" )\n",
|
||||
" score = run_chatgpt(prompt, client)\n",
|
||||
" try:\n",
|
||||
" scores.append(int(score))\n",
|
||||
" except ValueError:\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" return scores"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "71974dea-31ed-49af-abba-5c858bbbf49c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b071ce84-1866-427f-a272-b46700f364b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:03<00:00, 1.56it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 1 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 74.09\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00, 1.50it/s]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"model 2 response\n",
|
||||
"Number of scores: 100 of 100\n",
|
||||
"Average score: 56.57\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"for model in (\"model 1 response\", \"model 2 response\"):\n",
|
||||
"\n",
|
||||
" scores = generate_model_scores(json_data, model, client)\n",
|
||||
" print(f\"\\n{model}\")\n",
|
||||
" print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
|
||||
" print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
|
||||
"\n",
|
||||
" # Optionally save the scores\n",
|
||||
" save_path = Path(\"scores\") / f\"gpt4-{model.replace(' ', '-')}.json\"\n",
|
||||
" with open(save_path, \"w\") as file:\n",
|
||||
" json.dump(scores, file)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
2
Codes/ch07/03_model-evaluation/requirements-extra.txt
Normal file
2
Codes/ch07/03_model-evaluation/requirements-extra.txt
Normal file
@ -0,0 +1,2 @@
|
||||
openai>=1.30.3
|
||||
tqdm>=4.65.0
|
||||
269
Codes/ch07/03_model-evaluation/scores/correlation-analysis.ipynb
Normal file
269
Codes/ch07/03_model-evaluation/scores/correlation-analysis.ipynb
Normal file
File diff suppressed because one or more lines are too long
@ -0,0 +1 @@
|
||||
[0, 50, 20, 100, 0, 100, 0, 100, 100, 100, 55, 0, 100, 100, 100, 100, 100, 0, 98, 100, 100, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 0, 100, 100, 85, 100, 0, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 95, 20, 50, 85, 100, 100, 100, 100, 55, 100, 100, 100, 0, 100, 98, 100, 100, 100, 0, 85, 100, 100, 98, 100, 100, 100, 0, 100, 100, 100, 100, 0, 100, 0, 100, 100, 0, 0, 100, 50, 100, 100, 10, 100, 100, 100, 100, 0, 100, 100, 25, 100, 30]
|
||||
@ -0,0 +1 @@
|
||||
[0, 100, 0, 100, 0, 100, 0, 100, 0, 0, 50, 0, 100, 100, 100, 100, 100, 100, 100, 95, 0, 50, 100, 100, 0, 0, 100, 0, 0, 100, 0, 0, 100, 0, 67, 0, 0, 0, 100, 100, 95, 100, 100, 100, 0, 0, 0, 0, 100, 100, 100, 0, 55, 100, 0, 100, 65, 100, 100, 0, 100, 100, 100, 0, 100, 0, 85, 100, 100, 85, 0, 75, 100, 0, 0, 100, 100, 100, 0, 100, 0, 50, 100, 100, 0, 100, 0, 0, 100, 85, 100, 0, 100, 100, 0, 100, 100, 0, 0, 0]
|
||||
@ -0,0 +1 @@
|
||||
[20, 92, 85, 90, 20, 90, 22, 97, 60, 96, 20, 20, 98, 95, 90, 98, 95, 20, 98, 98, 92, 20, 96, 96, 100, 98, 98, 95, 20, 95, 98, 20, 85, 95, 80, 97, 40, 21, 100, 85, 95, 98, 92, 98, 69, 98, 80, 60, 60, 20, 80, 68, 80, 96, 96, 68, 80, 95, 80, 20, 95, 98, 80, 98, 94, 20, 40, 98, 100, 85, 98, 90, 95, 85, 95, 80, 98, 98, 25, 98, 40, 92, 95, 82, 87, 98, 80, 90, 95, 4, 90, 90, 80, 98, 20, 98, 98, 40, 92, 98]
|
||||
@ -0,0 +1 @@
|
||||
[76, 85, 67, 90, 20, 98, 22, 96, 40, 80, 40, 20, 90, 98, 80, 92, 98, 98, 95, 99, 55, 99, 80, 90, 20, 4, 98, 4, 40, 95, 14, 44, 95, 44, 80, 4, 4, 40, 95, 80, 98, 95, 92, 98, 68, 20, 20, 60, 95, 90, 98, 0, 20, 80, 20, 80, 92, 98, 98, 20, 95, 100, 95, 85, 98, 4, 40, 98, 98, 65, 20, 76, 100, 67, 44, 92, 75, 97, 27, 98, 20, 60, 90, 96, 67, 98, 80, 10, 80, 98, 100, 40, 92, 98, 20, 98, 98, 20, 20]
|
||||
7
Codes/ch07/04_preference-tuning-with-dpo/README.md
Normal file
7
Codes/ch07/04_preference-tuning-with-dpo/README.md
Normal file
@ -0,0 +1,7 @@
|
||||
# Chapter 7: Finetuning to Follow Instructions
|
||||
|
||||
- [create-preference-data-ollama.ipynb](create-preference-data-ollama.ipynb): A notebook that creates a synthetic dataset for preference finetuning dataset using Llama 3.1 and Ollama
|
||||
|
||||
- [dpo-from-scratch.ipynb](dpo-from-scratch.ipynb): This notebook implements Direct Preference Optimization (DPO) for LLM alignment
|
||||
|
||||
|
||||
@ -0,0 +1,595 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Generating A Preference Dataset With Llama 3.1 70B And Ollama"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Preference finetuning is a process to align an instruction-finetuned LLM with human preferences\n",
|
||||
"- There are multiple ways to create a dataset for preference finetuning an LLM\n",
|
||||
" 1. We use the instruction-finetuned LLM to generate multiple responses and have humans rank them based on their preference and/or given preference criteria\n",
|
||||
" 2. We use the instruction-finetuned LLM to generate multiple responses and have LLMs rank them based on given preference criteria\n",
|
||||
" 3. We use an LLM to generate preferred and dispreferred responses given certain preference criteria\n",
|
||||
"- In this notebook, we consider approach 3\n",
|
||||
"- This notebook uses a 70-billion-parameter Llama 3.1-Instruct model through ollama to generate preference labels for an instruction dataset\n",
|
||||
"- The expected format of the instruction dataset is as follows:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### Input\n",
|
||||
"\n",
|
||||
"```json\n",
|
||||
"[\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the state capital of California?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The state capital of California is Sacramento.\",\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"Provide a synonym for 'fast'.\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"A synonym for 'fast' is 'quick'.\",\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the capital of Greece?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The capital of Greece is Athens.\",\n",
|
||||
"\n",
|
||||
" },\n",
|
||||
"...\n",
|
||||
"]\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"The output dataset will look as follows, where more polite responses are preferred (`'chosen'`), and more impolite responses are dispreferred (`'rejected'`):\n",
|
||||
"\n",
|
||||
"```json\n",
|
||||
"[\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the state capital of California?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The state capital of California is Sacramento.\",\n",
|
||||
" \"rejected\": \"Look, the state capital of California is obviously Sacramento.\",\n",
|
||||
" \"chosen\": \"The state capital of California is Sacramento.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"Provide a synonym for 'fast'.\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"A synonym for 'fast' is 'quick'.\",\n",
|
||||
" \"chosen\": \"A suitable alternative to 'fast' would be 'quick'.\",\n",
|
||||
" \"rejected\": \"A synonym for 'fast' is 'quick'.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the capital of Greece?\",\n",
|
||||
" \"input\": \"\",\n",
|
||||
" \"output\": \"The capital of Greece is Athens.\",\n",
|
||||
" \"chosen\": \"I'd be happy to help! The capital of Greece is indeed Athens.\",\n",
|
||||
" \"rejected\": \"The capital of Greece is Athens.\"\n",
|
||||
" },\n",
|
||||
"...\n",
|
||||
"]\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"### Output\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- The code doesn't require a GPU and runs on a laptop given enough RAM"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tqdm version: 4.66.4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\"tqdm\", # Progress bar\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Installing Ollama and Downloading Llama 3.1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Ollama is an application to run LLMs efficiently\n",
|
||||
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
|
||||
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
|
||||
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
|
||||
"- Linux users can use the installation command provided on the ollama website\n",
|
||||
"\n",
|
||||
"- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1\">\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 70-billion-parameter Llama 3.1 model \n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# 70B model\n",
|
||||
"ollama run llama3.1:70b\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The output looks like as follows:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"$ ollama run llama3.1:70b\n",
|
||||
"pulling manifest\n",
|
||||
"pulling aa81b541aae6... 100% ▕████████████████▏ 39 GB\n",
|
||||
"pulling 8cf247399e57... 100% ▕████████████████▏ 1.7 KB\n",
|
||||
"pulling f1cd752815fc... 100% ▕████████████████▏ 12 KB\n",
|
||||
"pulling 56bb8bd477a5... 100% ▕████████████████▏ 96 B\n",
|
||||
"pulling 3c1c2d3df5b3... 100% ▕████████████████▏ 486 B\n",
|
||||
"verifying sha256 digest\n",
|
||||
"writing manifest\n",
|
||||
"removing any unused layers\n",
|
||||
"success\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- Note that `llama3.1:70b` refers to the instruction finetuned 70-billion-parameter Llama 3.1 model\n",
|
||||
"\n",
|
||||
"- Alternatively, you can also use the smaller, more resource-effiicent 8-billion-parameters Llama 3.1 model, by replacing `llama3.1:70b` with `llama3.1`\n",
|
||||
"\n",
|
||||
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
|
||||
"\n",
|
||||
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
">>> What do llamas eat?\n",
|
||||
"Llamas are ruminant animals, which means they have a four-chambered \n",
|
||||
"stomach and eat plants that are high in fiber. In the wild, llamas \n",
|
||||
"typically feed on:\n",
|
||||
"1. Grasses: They love to graze on various types of grasses, including tall \n",
|
||||
"grasses, wheat, oats, and barley.\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- You can end this session using the input `/bye`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Ollama's REST API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
|
||||
"- Before you run the next cells in this notebook, make sure that ollama is still running, as described above, via\n",
|
||||
" - `ollama serve` in a terminal\n",
|
||||
" - the ollama application\n",
|
||||
"- Next, run the following code cell to query the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet consists of:\n",
|
||||
"\n",
|
||||
"1. **Grasses**: Various types of grasses, including timothy grass, orchard grass, and brome grass.\n",
|
||||
"2. **Hay**: High-quality hay, such as alfalfa or clover hay, is a staple in a llama's diet.\n",
|
||||
"3. **Leaves**: Leaves from trees and shrubs, like willow, cottonwood, and mesquite, are also eaten.\n",
|
||||
"4. **Fruits and vegetables**: Llamas enjoy fruits like apples, carrots, and sweet potatoes, as well as leafy greens like kale and spinach.\n",
|
||||
"5. **Grains**: In moderation, llamas can eat grains like oats, barley, and corn.\n",
|
||||
"\n",
|
||||
"It's essential to note that llamas have a unique digestive system, with a three-part stomach and a large cecum (a specialized part of the large intestine). This allows them to break down and extract nutrients from plant material more efficiently than many other animals.\n",
|
||||
"\n",
|
||||
"A typical llama diet might consist of:\n",
|
||||
"\n",
|
||||
"* 1-2% of their body weight in hay per day\n",
|
||||
"* 0.5-1% of their body weight in grains per day (if fed)\n",
|
||||
"* Free-choice access to fresh water\n",
|
||||
"* Limited amounts of fruits and vegetables as treats\n",
|
||||
"\n",
|
||||
"It's also important to ensure that llamas have access to a mineral supplement, such as a salt lick or loose minerals, to help maintain optimal health.\n",
|
||||
"\n",
|
||||
"Remember, every llama is different, and their dietary needs may vary depending on factors like age, size, and activity level. Consult with a veterinarian or experienced llama breeder for specific guidance on feeding your llama.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def query_model(prompt, model=\"llama3.1:70b\", url=\"http://localhost:11434/api/chat\"):\n",
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"messages\": [\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": prompt\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" \"options\": {\n",
|
||||
" \"seed\": 123,\n",
|
||||
" \"temperature\": 0,\n",
|
||||
" }\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
" payload = json.dumps(data).encode(\"utf-8\")\n",
|
||||
"\n",
|
||||
" # Create a request object, setting the method to POST and adding necessary headers\n",
|
||||
" request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
|
||||
" request.add_header(\"Content-Type\", \"application/json\")\n",
|
||||
"\n",
|
||||
" # Send the request and capture the response\n",
|
||||
" response_data = \"\"\n",
|
||||
" with urllib.request.urlopen(request) as response:\n",
|
||||
" # Read and decode the response\n",
|
||||
" while True:\n",
|
||||
" line = response.readline().decode(\"utf-8\")\n",
|
||||
" if not line:\n",
|
||||
" break\n",
|
||||
" response_json = json.loads(line)\n",
|
||||
" response_data += response_json[\"message\"][\"content\"]\n",
|
||||
"\n",
|
||||
" return response_data\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"result = query_model(\"What do Llamas eat?\")\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load JSON Entries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's get to the data generation part\n",
|
||||
"- Here, for a hands-on example, we use the `instruction-data.json` file that we originally used to instruction-finetune the model in chapter 7:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of entries: 1100\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"json_file = Path(\"..\", \"01_main-chapter-code\", \"instruction-data.json\")\n",
|
||||
"\n",
|
||||
"with open(json_file, \"r\") as file:\n",
|
||||
" json_data = json.load(file)\n",
|
||||
"\n",
|
||||
"print(\"Number of entries:\", len(json_data))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) that we trained the model to generate via instruction finetuning based on the `'input'` and `'instruction'`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "7222fdc0-5684-4f2b-b741-3e341851359e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'instruction': 'Evaluate the following phrase by transforming it into the spelling given.',\n",
|
||||
" 'input': 'freind --> friend',\n",
|
||||
" 'output': 'The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\".'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"json_data[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Below is a small utility function that formats the instruction and input:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def format_input(entry):\n",
|
||||
" instruction_text = (\n",
|
||||
" f\"Below is an instruction that describes a task. Write a response that \"\n",
|
||||
" f\"appropriately completes the request.\"\n",
|
||||
" f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
|
||||
" instruction_text + input_text\n",
|
||||
"\n",
|
||||
" return instruction_text + input_text"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "39a55283-7d51-4136-ba60-f799d49f4098",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's try the ollama API to generate a `'chosen'` and `'rejected'` response for preference tuning a model\n",
|
||||
"- Here, to for illustration purposes, we create answers that are more or less polite\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "735cc089-d127-480a-b39d-0782581f0c41",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\".\n",
|
||||
"\n",
|
||||
"impolite response:\n",
|
||||
">> The spelling of the given phrase \"freind\" is flat out wrong, get it together, the correct spelling is \"friend\".\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> He goes to the park every day.\n",
|
||||
"\n",
|
||||
"polite response:\n",
|
||||
">> He goes to the park daily, if I'm not mistaken.\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> 45 kilometers is 45000 meters.\n",
|
||||
"\n",
|
||||
"polite response:\n",
|
||||
">> 45 kilometers is equivalent to 45000 meters.\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> Although it was raining, they went for a walk.\n",
|
||||
"\n",
|
||||
"polite response:\n",
|
||||
">> Although it was raining outside, they still decided to go for a walk.\n",
|
||||
"\n",
|
||||
"Dataset response:\n",
|
||||
">> 1, 4, 9, 16, 25, 36, 49, 64, 81, 100.\n",
|
||||
"\n",
|
||||
"impolite response:\n",
|
||||
">> Here are your precious square numbers: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import random\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"for entry in json_data[:5]:\n",
|
||||
" \n",
|
||||
" politeness = random.choice([\"polite\", \"impolite\"]) \n",
|
||||
" prompt = (\n",
|
||||
" f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"slightly rewrite the output to be more {politeness}.\"\n",
|
||||
" \"Keep the modification minimal.\"\n",
|
||||
" \"Only return return the generated response and nothing else.\"\n",
|
||||
" )\n",
|
||||
" print(\"\\nDataset response:\")\n",
|
||||
" print(\">>\", entry['output'])\n",
|
||||
" print(f\"\\n{politeness} response:\")\n",
|
||||
" print(\">>\", query_model(prompt)) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- If we find that the generated responses above look reasonable, we can go to the next step and apply the prompt to the whole dataset\n",
|
||||
"- Here, we add a `'chosen'` key for the preferred response and a `'rejected'` response for the dispreferred response"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "3349dbbc-963f-4af3-9790-12dbfdca63c3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import random\n",
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"def generate_model_responses(json_data):\n",
|
||||
"\n",
|
||||
" for i, entry in enumerate(tqdm(json_data, desc=\"Writing entries\")):\n",
|
||||
" politeness = random.choice([\"polite\", \"impolite\"]) \n",
|
||||
" prompt = (\n",
|
||||
" f\"Given the input `{format_input(entry)}` \"\n",
|
||||
" f\"and correct output `{entry['output']}`, \"\n",
|
||||
" f\"slightly rewrite the output to be more {politeness}.\"\n",
|
||||
" \"Keep the modification minimal.\"\n",
|
||||
" \"Only return return the generated response and nothing else.\"\n",
|
||||
" )\n",
|
||||
" response = query_model(prompt)\n",
|
||||
" \n",
|
||||
" if politeness == \"polite\":\n",
|
||||
" json_data[i][\"chosen\"] = response\n",
|
||||
" json_data[i][\"rejected\"] = entry[\"output\"]\n",
|
||||
" else:\n",
|
||||
" json_data[i][\"rejected\"] = response\n",
|
||||
" json_data[i][\"chosen\"] = entry[\"output\"] "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b071ce84-1866-427f-a272-b46700f364b2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Let's now apply this evaluation to the whole dataset and compute the average score of each model (this takes about 1 minute per model on an M3 MacBook Air laptop)\n",
|
||||
"- Note that ollama is not fully deterministic across operating systems (as of this writing) so the numbers you are getting might slightly differ from the ones shown below"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "4f700d4b-19e5-4404-afa7-b0f093024232",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Writing entries: 100%|██████████| 1100/1100 [17:20<00:00, 1.06it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"generate_model_responses(json_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "838d9747-0f7d-46fe-aab5-9ee6b765d021",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"instruction-data-with-preference.json\", \"w\") as file:\n",
|
||||
" json.dump(json_data, file, indent=4)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
3096
Codes/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
Normal file
3096
Codes/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
Normal file
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
470
Codes/ch07/04_preference-tuning-with-dpo/previous_chapters.py
Normal file
470
Codes/ch07/04_preference-tuning-with-dpo/previous_chapters.py
Normal file
@ -0,0 +1,470 @@
|
||||
# Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
|
||||
# Source for "Build a Large Language Model From Scratch"
|
||||
# - https://www.manning.com/books/build-a-large-language-model-from-scratch
|
||||
# Code: https://github.com/rasbt/LLMs-from-scratch
|
||||
#
|
||||
# This file collects all the relevant code that we covered thus far
|
||||
# throughout Chapters 2-6.
|
||||
# This file can be run as a standalone script.
|
||||
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.ticker import MaxNLocator
|
||||
import numpy as np
|
||||
import tiktoken
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 2
|
||||
#####################################
|
||||
|
||||
|
||||
class GPTDatasetV1(Dataset):
|
||||
def __init__(self, txt, tokenizer, max_length, stride):
|
||||
self.tokenizer = tokenizer
|
||||
self.input_ids = []
|
||||
self.target_ids = []
|
||||
|
||||
# Tokenize the entire text
|
||||
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
|
||||
|
||||
# Use a sliding window to chunk the book into overlapping sequences of max_length
|
||||
for i in range(0, len(token_ids) - max_length, stride):
|
||||
input_chunk = token_ids[i:i + max_length]
|
||||
target_chunk = token_ids[i + 1: i + max_length + 1]
|
||||
self.input_ids.append(torch.tensor(input_chunk))
|
||||
self.target_ids.append(torch.tensor(target_chunk))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.input_ids)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return self.input_ids[idx], self.target_ids[idx]
|
||||
|
||||
|
||||
def create_dataloader_v1(txt, batch_size=4, max_length=256,
|
||||
stride=128, shuffle=True, drop_last=True, num_workers=0):
|
||||
# Initialize the tokenizer
|
||||
tokenizer = tiktoken.get_encoding("gpt2")
|
||||
|
||||
# Create dataset
|
||||
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
|
||||
|
||||
# Create dataloader
|
||||
dataloader = DataLoader(
|
||||
dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
|
||||
|
||||
return dataloader
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 3
|
||||
#####################################
|
||||
class MultiHeadAttention(nn.Module):
|
||||
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
|
||||
super().__init__()
|
||||
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
|
||||
|
||||
self.d_out = d_out
|
||||
self.num_heads = num_heads
|
||||
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
|
||||
|
||||
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
|
||||
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
|
||||
|
||||
def forward(self, x):
|
||||
b, num_tokens, d_in = x.shape
|
||||
|
||||
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
|
||||
queries = self.W_query(x)
|
||||
values = self.W_value(x)
|
||||
|
||||
# We implicitly split the matrix by adding a `num_heads` dimension
|
||||
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
|
||||
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
|
||||
|
||||
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
|
||||
keys = keys.transpose(1, 2)
|
||||
queries = queries.transpose(1, 2)
|
||||
values = values.transpose(1, 2)
|
||||
|
||||
# Compute scaled dot-product attention (aka self-attention) with a causal mask
|
||||
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
|
||||
|
||||
# Original mask truncated to the number of tokens and converted to boolean
|
||||
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
|
||||
|
||||
# Use the mask to fill attention scores
|
||||
attn_scores.masked_fill_(mask_bool, -torch.inf)
|
||||
|
||||
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
|
||||
attn_weights = self.dropout(attn_weights)
|
||||
|
||||
# Shape: (b, num_tokens, num_heads, head_dim)
|
||||
context_vec = (attn_weights @ values).transpose(1, 2)
|
||||
|
||||
# Combine heads, where self.d_out = self.num_heads * self.head_dim
|
||||
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
|
||||
context_vec = self.out_proj(context_vec) # optional projection
|
||||
|
||||
return context_vec
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 4
|
||||
#####################################
|
||||
class LayerNorm(nn.Module):
|
||||
def __init__(self, emb_dim):
|
||||
super().__init__()
|
||||
self.eps = 1e-5
|
||||
self.scale = nn.Parameter(torch.ones(emb_dim))
|
||||
self.shift = nn.Parameter(torch.zeros(emb_dim))
|
||||
|
||||
def forward(self, x):
|
||||
mean = x.mean(dim=-1, keepdim=True)
|
||||
var = x.var(dim=-1, keepdim=True, unbiased=False)
|
||||
norm_x = (x - mean) / torch.sqrt(var + self.eps)
|
||||
return self.scale * norm_x + self.shift
|
||||
|
||||
|
||||
class GELU(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def forward(self, x):
|
||||
return 0.5 * x * (1 + torch.tanh(
|
||||
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
|
||||
(x + 0.044715 * torch.pow(x, 3))
|
||||
))
|
||||
|
||||
|
||||
class FeedForward(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.layers = nn.Sequential(
|
||||
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
|
||||
GELU(),
|
||||
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.layers(x)
|
||||
|
||||
|
||||
class TransformerBlock(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.att = MultiHeadAttention(
|
||||
d_in=cfg["emb_dim"],
|
||||
d_out=cfg["emb_dim"],
|
||||
context_length=cfg["context_length"],
|
||||
num_heads=cfg["n_heads"],
|
||||
dropout=cfg["drop_rate"],
|
||||
qkv_bias=cfg["qkv_bias"])
|
||||
self.ff = FeedForward(cfg)
|
||||
self.norm1 = LayerNorm(cfg["emb_dim"])
|
||||
self.norm2 = LayerNorm(cfg["emb_dim"])
|
||||
self.drop_resid = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
def forward(self, x):
|
||||
# Shortcut connection for attention block
|
||||
shortcut = x
|
||||
x = self.norm1(x)
|
||||
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_resid(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
# Shortcut connection for feed-forward block
|
||||
shortcut = x
|
||||
x = self.norm2(x)
|
||||
x = self.ff(x)
|
||||
x = self.drop_resid(x)
|
||||
x = x + shortcut # Add the original input back
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class GPTModel(nn.Module):
|
||||
def __init__(self, cfg):
|
||||
super().__init__()
|
||||
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
|
||||
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
|
||||
self.drop_emb = nn.Dropout(cfg["drop_rate"])
|
||||
|
||||
self.trf_blocks = nn.Sequential(
|
||||
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
|
||||
|
||||
self.final_norm = LayerNorm(cfg["emb_dim"])
|
||||
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
|
||||
|
||||
def forward(self, in_idx):
|
||||
batch_size, seq_len = in_idx.shape
|
||||
tok_embeds = self.tok_emb(in_idx)
|
||||
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
|
||||
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
|
||||
x = self.drop_emb(x)
|
||||
x = self.trf_blocks(x)
|
||||
x = self.final_norm(x)
|
||||
logits = self.out_head(x)
|
||||
return logits
|
||||
|
||||
|
||||
def generate_text_simple(model, idx, max_new_tokens, context_size):
|
||||
# idx is (B, T) array of indices in the current context
|
||||
for _ in range(max_new_tokens):
|
||||
|
||||
# Crop current context if it exceeds the supported context size
|
||||
# E.g., if LLM supports only 5 tokens, and the context size is 10
|
||||
# then only the last 5 tokens are used as context
|
||||
idx_cond = idx[:, -context_size:]
|
||||
|
||||
# Get the predictions
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
|
||||
# Focus only on the last time step
|
||||
# (batch, n_token, vocab_size) becomes (batch, vocab_size)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# Get the idx of the vocab entry with the highest logits value
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch, 1)
|
||||
|
||||
# Append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
|
||||
|
||||
return idx
|
||||
|
||||
|
||||
#####################################
|
||||
# Chapter 5
|
||||
#####################################
|
||||
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
|
||||
|
||||
# For-loop is the same as before: Get logits, and only focus on last time step
|
||||
for _ in range(max_new_tokens):
|
||||
idx_cond = idx[:, -context_size:]
|
||||
with torch.no_grad():
|
||||
logits = model(idx_cond)
|
||||
logits = logits[:, -1, :]
|
||||
|
||||
# New: Filter logits with top_k sampling
|
||||
if top_k is not None:
|
||||
# Keep only top_k values
|
||||
top_logits, _ = torch.topk(logits, top_k)
|
||||
min_val = top_logits[:, -1]
|
||||
logits = torch.where(logits < min_val, torch.tensor(float('-inf')).to(logits.device), logits)
|
||||
|
||||
# New: Apply temperature scaling
|
||||
if temperature > 0.0:
|
||||
logits = logits / temperature
|
||||
|
||||
# Apply softmax to get probabilities
|
||||
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
|
||||
|
||||
# Sample from the distribution
|
||||
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
|
||||
|
||||
# Otherwise same as before: get idx of the vocab entry with the highest logits value
|
||||
else:
|
||||
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
|
||||
|
||||
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
|
||||
break
|
||||
|
||||
# Same as before: append sampled index to the running sequence
|
||||
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
|
||||
|
||||
return idx
|
||||
|
||||
|
||||
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
|
||||
eval_freq, eval_iter, start_context, tokenizer):
|
||||
# Initialize lists to track losses and tokens seen
|
||||
train_losses, val_losses, track_tokens_seen = [], [], []
|
||||
tokens_seen, global_step = 0, -1
|
||||
|
||||
# Main training loop
|
||||
for epoch in range(num_epochs):
|
||||
model.train() # Set model to training mode
|
||||
|
||||
for input_batch, target_batch in train_loader:
|
||||
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
loss.backward() # Calculate loss gradients
|
||||
optimizer.step() # Update model weights using loss gradients
|
||||
tokens_seen += input_batch.numel()
|
||||
global_step += 1
|
||||
|
||||
# Optional evaluation step
|
||||
if global_step % eval_freq == 0:
|
||||
train_loss, val_loss = evaluate_model(
|
||||
model, train_loader, val_loader, device, eval_iter)
|
||||
train_losses.append(train_loss)
|
||||
val_losses.append(val_loss)
|
||||
track_tokens_seen.append(tokens_seen)
|
||||
print(f"Ep {epoch+1} (Step {global_step:06d}): "
|
||||
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
|
||||
|
||||
# Print a sample text after each epoch
|
||||
generate_and_print_sample(
|
||||
model, tokenizer, device, start_context
|
||||
)
|
||||
|
||||
return train_losses, val_losses, track_tokens_seen
|
||||
|
||||
|
||||
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
|
||||
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
|
||||
model.train()
|
||||
return train_loss, val_loss
|
||||
|
||||
|
||||
def generate_and_print_sample(model, tokenizer, device, start_context):
|
||||
model.eval()
|
||||
context_size = model.pos_emb.weight.shape[0]
|
||||
encoded = text_to_token_ids(start_context, tokenizer).to(device)
|
||||
with torch.no_grad():
|
||||
token_ids = generate_text_simple(
|
||||
model=model, idx=encoded,
|
||||
max_new_tokens=50, context_size=context_size
|
||||
)
|
||||
decoded_text = token_ids_to_text(token_ids, tokenizer)
|
||||
print(decoded_text.replace("\n", " ")) # Compact print format
|
||||
model.train()
|
||||
|
||||
|
||||
def assign(left, right):
|
||||
if left.shape != right.shape:
|
||||
raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
|
||||
return torch.nn.Parameter(torch.tensor(right))
|
||||
|
||||
|
||||
def load_weights_into_gpt(gpt, params):
|
||||
gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
|
||||
gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
|
||||
|
||||
for b in range(len(params["blocks"])):
|
||||
q_w, k_w, v_w = np.split(
|
||||
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
|
||||
gpt.trf_blocks[b].att.W_query.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
|
||||
gpt.trf_blocks[b].att.W_key.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
|
||||
gpt.trf_blocks[b].att.W_value.weight = assign(
|
||||
gpt.trf_blocks[b].att.W_value.weight, v_w.T)
|
||||
|
||||
q_b, k_b, v_b = np.split(
|
||||
(params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
|
||||
gpt.trf_blocks[b].att.W_query.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_query.bias, q_b)
|
||||
gpt.trf_blocks[b].att.W_key.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_key.bias, k_b)
|
||||
gpt.trf_blocks[b].att.W_value.bias = assign(
|
||||
gpt.trf_blocks[b].att.W_value.bias, v_b)
|
||||
|
||||
gpt.trf_blocks[b].att.out_proj.weight = assign(
|
||||
gpt.trf_blocks[b].att.out_proj.weight,
|
||||
params["blocks"][b]["attn"]["c_proj"]["w"].T)
|
||||
gpt.trf_blocks[b].att.out_proj.bias = assign(
|
||||
gpt.trf_blocks[b].att.out_proj.bias,
|
||||
params["blocks"][b]["attn"]["c_proj"]["b"])
|
||||
|
||||
gpt.trf_blocks[b].ff.layers[0].weight = assign(
|
||||
gpt.trf_blocks[b].ff.layers[0].weight,
|
||||
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
|
||||
gpt.trf_blocks[b].ff.layers[0].bias = assign(
|
||||
gpt.trf_blocks[b].ff.layers[0].bias,
|
||||
params["blocks"][b]["mlp"]["c_fc"]["b"])
|
||||
gpt.trf_blocks[b].ff.layers[2].weight = assign(
|
||||
gpt.trf_blocks[b].ff.layers[2].weight,
|
||||
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
|
||||
gpt.trf_blocks[b].ff.layers[2].bias = assign(
|
||||
gpt.trf_blocks[b].ff.layers[2].bias,
|
||||
params["blocks"][b]["mlp"]["c_proj"]["b"])
|
||||
|
||||
gpt.trf_blocks[b].norm1.scale = assign(
|
||||
gpt.trf_blocks[b].norm1.scale,
|
||||
params["blocks"][b]["ln_1"]["g"])
|
||||
gpt.trf_blocks[b].norm1.shift = assign(
|
||||
gpt.trf_blocks[b].norm1.shift,
|
||||
params["blocks"][b]["ln_1"]["b"])
|
||||
gpt.trf_blocks[b].norm2.scale = assign(
|
||||
gpt.trf_blocks[b].norm2.scale,
|
||||
params["blocks"][b]["ln_2"]["g"])
|
||||
gpt.trf_blocks[b].norm2.shift = assign(
|
||||
gpt.trf_blocks[b].norm2.shift,
|
||||
params["blocks"][b]["ln_2"]["b"])
|
||||
|
||||
gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
|
||||
gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
|
||||
gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
|
||||
|
||||
|
||||
def text_to_token_ids(text, tokenizer):
|
||||
encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
|
||||
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
|
||||
return encoded_tensor
|
||||
|
||||
|
||||
def token_ids_to_text(token_ids, tokenizer):
|
||||
flat = token_ids.squeeze(0) # remove batch dimension
|
||||
return tokenizer.decode(flat.tolist())
|
||||
|
||||
|
||||
def calc_loss_batch(input_batch, target_batch, model, device):
|
||||
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
|
||||
logits = model(input_batch)
|
||||
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
|
||||
return loss
|
||||
|
||||
|
||||
def calc_loss_loader(data_loader, model, device, num_batches=None):
|
||||
total_loss = 0.
|
||||
if len(data_loader) == 0:
|
||||
return float("nan")
|
||||
elif num_batches is None:
|
||||
num_batches = len(data_loader)
|
||||
else:
|
||||
# Reduce the number of batches to match the total number of batches in the data loader
|
||||
# if num_batches exceeds the number of batches in the data loader
|
||||
num_batches = min(num_batches, len(data_loader))
|
||||
for i, (input_batch, target_batch) in enumerate(data_loader):
|
||||
if i < num_batches:
|
||||
loss = calc_loss_batch(input_batch, target_batch, model, device)
|
||||
total_loss += loss.item()
|
||||
else:
|
||||
break
|
||||
return total_loss / num_batches
|
||||
|
||||
|
||||
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses, label="loss"):
|
||||
fig, ax1 = plt.subplots(figsize=(5, 3))
|
||||
|
||||
# Plot training and validation loss against epochs
|
||||
ax1.plot(epochs_seen, train_losses, label=f"Training {label}")
|
||||
ax1.plot(epochs_seen, val_losses, linestyle="-.", label=f"Validation {label}")
|
||||
ax1.set_xlabel("Epochs")
|
||||
ax1.set_ylabel(label.capitalize())
|
||||
ax1.legend()
|
||||
ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) # only show integer labels on x-axis
|
||||
|
||||
# Create a second x-axis for tokens seen
|
||||
ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis
|
||||
ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks
|
||||
ax2.set_xlabel("Tokens seen")
|
||||
|
||||
fig.tight_layout() # Adjust layout to make room
|
||||
plt.savefig(f"{label}-plot.pdf")
|
||||
plt.show()
|
||||
6
Codes/ch07/05_dataset-generation/README.md
Normal file
6
Codes/ch07/05_dataset-generation/README.md
Normal file
@ -0,0 +1,6 @@
|
||||
# Generating a Dataset for Instruction Finetuning
|
||||
|
||||
This folder contains utility code that can be used for generating a dataset for instruction finetuning.
|
||||
|
||||
- [llama3-ollama.ipynb](llama3-ollama.ipynb): A notebook that creates a synthetic instruction finetuning dataset using Llama 3 and Ollama
|
||||
|
||||
@ -0,0 +1,22 @@
|
||||
[
|
||||
{
|
||||
"instruction": "What is the significance of the number 7 in various cultures and religions?",
|
||||
"output": "The number 7 has been a significant and recurring theme across many cultures and religions, often imbuing it with special meaning and symbolism. Here are some examples:\n\n1. **Numerology**: In numerology, the number 7 is considered sacred and mystical, associated with spiritual awakening, introspection, and enlightenment.\n2. **Judaism**: The Torah has seven days of creation, seven weeks in the wilderness, and seven years of rest (Sabbatical year). Seven is also a symbol of completion or perfection.\n3. **Christianity**: In Christianity, there are seven deadly sins, seven virtues, and seven sacraments. Jesus was said to have spoken seven sermons, and the number 7 appears in various biblical accounts, such as the seven days of creation and the seven angels who appear before God.\n4. **Islam**: In Islamic tradition, there are seven heavens, seven earths, and seven veils that separate the physical world from the divine realm. The Quran mentions seven verses (Ayats) that were revealed to Muhammad in a single revelation.\n5. **Hinduism**: In Hindu mythology, there are seven chakras (energy centers) in the human body, seven colors of the rainbow, and seven planets (including the sun and moon). The number 7 is also associated with the seven aspects of the divine feminine (Saptamatrikas).\n6. **Buddhism**: In Buddhist cosmology, there are seven levels of existence (dhatus) and seven stages of enlightenment (bodhisattva path). The seven days of meditation retreats in Tibetan Buddhism are also a nod to this number.\n7. **Ancient Greek and Roman cultures**: The number 7 was considered perfect or complete, with seven being the number of planets known at the time (including Earth, Sun, Moon, Mars, Mercury, Jupiter, and Venus). Seven days were also the typical duration of a week in these cultures.\n8. **Chinese culture**: In Chinese astrology, there are seven celestial stems and 12 earthly branches, creating a system of cyclical time reckoning. The number 7 is also associated with good luck and prosperity.\n9. **African cultures**: In some African traditions, the number 7 is associated with spiritual powers, such as divination and healing. Seven is also considered a sacred number in Yoruba culture, representing the seven orishas (deities).\n10. **Esoteric traditions**: Many esoteric traditions, including Hermeticism, Gnosticism, and Theosophy, have employed the number 7 to represent spiritual principles, such as the seven planes of existence, seven chakras, or seven sacred seals.\n\nThese examples illustrate the widespread significance of the number 7 across various cultures and religions. While the specific meanings may vary, the number 7 often represents completion, perfection, spiritual growth, and mystical connections."
|
||||
},
|
||||
{
|
||||
"instruction": "I've been wondering, what is the most important thing you can learn from a child?",
|
||||
"output": "What a great question!\n\nChildren have so much to teach us, and I think one of the most important things we can learn from them is the power of simplicity, curiosity, and joy. Children are often unencumbered by the complexities and worries that come with adulthood, and they approach life with a fresh perspective.\n\nHere are some specific things children can teach us:\n\n1. **Simplicity**: Children are masters of simplifying things. They don't get bogged down in details or worry about what others think. They just enjoy the moment and find joy in simple things like playing outside, drawing, or reading.\n2. **Curiosity**: Kids are naturally curious and love to explore their world. They ask questions, investigate, and learn from experience. This curiosity can be a powerful reminder to us adults to stay open-minded and seek new knowledge and experiences.\n3. **Joy and playfulness**: Children have an amazing capacity for joy and playfulness. They find happiness in the simplest things, like playing with blocks or having a water balloon fight. We could all benefit from embracing our inner child and finding more joy in life's simple pleasures.\n4. **Unconditional love**: Children are capable of loving unconditionally, without judgment or expectation. This can be a powerful reminder to us adults to practice self-love, self-acceptance, and kindness towards others.\n5. **Creativity and imagination**: Kids are incredibly creative and imaginative, often thinking outside the box and coming up with innovative solutions. This creativity can inspire us adults to think differently, try new things, and approach problems from unique angles.\n6. **Resilience**: Children are surprisingly resilient in the face of challenges and setbacks. They learn to adapt, cope, and move forward, teaching us valuable lessons about perseverance and bouncing back from adversity.\n\nIn summary, learning from children can help us regain a sense of simplicity, curiosity, joy, and playfulness, while also reminding us of the importance of unconditional love, creativity, and resilience."
|
||||
},
|
||||
{
|
||||
"instruction": null,
|
||||
"output": ""
|
||||
},
|
||||
{
|
||||
"instruction": "What is the best way to deal with a difficult person?",
|
||||
"output": "Dealing with a difficult person can be challenging, but there are strategies that can help you navigate the situation effectively. Here are some tips:\n\n1. **Stay calm**: Take a deep breath and try not to take their behavior personally. A calm demeanor can help de-escalate tensions and prevent misunderstandings.\n2. **Listen actively**: Sometimes, people act out because they feel unheard or misunderstood. Make an effort to listen carefully to what they're saying, and respond thoughtfully.\n3. **Set boundaries**: Establish clear limits on what you are and aren't willing to engage in. Be firm but respectful when communicating your needs.\n4. **Avoid taking the bait**: Don't let their provocations get under your skin. Stay focused on the issue at hand and avoid getting drawn into an argument or debate.\n5. **Use \"I\" statements**: When expressing yourself, use \"I\" statements instead of \"you\" statements, which can come across as accusatory. This helps to reduce defensiveness and promotes a more constructive conversation.\n6. **Practice empathy**: Try to understand where the other person is coming from, even if you don't agree with their perspective. Showing that you care about their feelings can help diffuse tension.\n7. **Don't take it personally**: Remember that the difficult person's behavior is often a reflection of themselves, not you. Keep your self-worth and confidence intact.\n8. **Seek common ground**: Look for areas of agreement or shared interests. This can help to build bridges and create a more positive atmosphere.\n9. **Use humor (carefully)**: A well-timed, lighthearted joke or witty remark can help diffuse tension and lighten the mood. However, be cautious not to offend or make light of serious issues.\n10. **Know when to disengage**: If the situation becomes too heated or toxic, it may be necessary to take a step back and re-engage at a later time when emotions have cooled down.\n11. **Seek support**: Don't be afraid to ask for help from friends, family, or a professional if you're struggling to manage your interactions with a difficult person.\n12. **Practice self-care**: Take care of yourself physically, emotionally, and mentally. Engage in activities that bring you joy and help you maintain your energy and resilience.\n\nRemember, dealing with a difficult person is not about winning an argument or changing their behavior; it's about maintaining your own emotional well-being and responding constructively to the situation."
|
||||
},
|
||||
{
|
||||
"instruction": "I'm looking for a way to get my cat's attention when they're hiding under the bed or in a closet.",
|
||||
"output": "The classic \"where'd my cat go?\" conundrum! Don't worry, I've got some tips to help you coax your kitty out from their hiding spots:\n\n1. **Use their favorite treats**: Cats love food, and familiar treats can be a powerful lure. Try calling your cat's name and saying \"treat time\" in a playful tone. This might encourage them to emerge and investigate.\n2. **Make some noise**: Cats have poor eyesight but excellent hearing. Gently knock on the bed frame or closet door with your knuckles, making a soft, rhythmic sound. This can help your cat pinpoint where you are and entice them to come out.\n3. **Speak softly and calmly**: When speaking to your cat, use a gentle, soothing tone. Avoid loud or harsh voices, as these might scare them even further into hiding.\n4. **Use verbal cues**: Establish a consistent verbal cue, like \"come on out\" or \"let's play,\" which can become associated with the idea of leaving their hiding spot.\n5. **Create a \"safe zone\"**: If your cat is hiding due to fear or anxiety (e.g., from loud noises or other pets), try creating a safe, comfortable space for them to emerge into. This could be a cozy blanket or a familiar toy.\n6. **Patiently wait it out**: Sometimes, cats just need time and space to feel secure enough to come out. Give your cat the opportunity to leave their hiding spot at their own pace.\n7. **Use a flashlight (carefully)**: If your cat is hiding in a dark space, try using a flashlight to create a gentle beam of light. Be cautious not to shine it directly into their eyes, as this could startle them further.\n8. **Offer a familiar object**: Place a familiar toy or blanket near the entrance to the hiding spot, which can help your cat feel more comfortable coming out.\n9. **Make the space inviting**: If your cat is hiding under the bed, try moving any clutter or dust bunnies away from the area. Make the space underneath the bed a pleasant place for them to emerge into.\n10. **Be patient and don't force it**: Respect your cat's boundaries and allow them to come out when they're ready. Forcing them to leave their hiding spot can create negative associations and make them more likely to hide in the future.\n\nRemember, every cat is different, so try a combination of these methods to see what works best for your feline friend."
|
||||
}
|
||||
]
|
||||
506
Codes/ch07/05_dataset-generation/llama3-ollama.ipynb
Normal file
506
Codes/ch07/05_dataset-generation/llama3-ollama.ipynb
Normal file
@ -0,0 +1,506 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "136a4efe-fb99-4311-8679-e0a5b6282755",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table style=\"width:100%\">\n",
|
||||
"<tr>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<font size=\"2\">\n",
|
||||
"Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
|
||||
"<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
|
||||
"</font>\n",
|
||||
"</td>\n",
|
||||
"<td style=\"vertical-align:middle; text-align:left;\">\n",
|
||||
"<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
|
||||
"</td>\n",
|
||||
"</tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Generating An Instruction Dataset via Llama 3 and Ollama"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a128651b-f326-4232-a994-42f38b7ed520",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- This notebook uses an 8-billion-parameter Llama 3 model through ollama to generate a synthetic dataset using the \"hack\" proposed in the \"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing\" paper ([https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464))\n",
|
||||
"\n",
|
||||
"- The generated dataset will be an instruction dataset with \"instruction\" and \"output\" field similar to what can be found in Alpaca:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"{\n",
|
||||
" \"instruction\": \"What is the atomic number of helium?\",\n",
|
||||
" \"output\": \"The atomic number of helium is 2.\",\n",
|
||||
"},\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)\n",
|
||||
"\n",
|
||||
"*Note that the instruction datasets created here are for educational purposes. However, it is the users' duty to ensure that their use adheres to the terms of the relevant licensing agreements with Meta AI's Llama 3.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "63610acc-db94-437f-8d38-e99dca0299cb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"tqdm version: 4.66.4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from importlib.metadata import version\n",
|
||||
"\n",
|
||||
"pkgs = [\n",
|
||||
" \"tqdm\", # Progress bar\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"for p in pkgs:\n",
|
||||
" print(f\"{p} version: {version(p)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Installing Ollama and Downloading Llama 3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5a092280-5462-4709-a3fe-8669a4a8a0a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Ollama is an application to run LLMs efficiently\n",
|
||||
"- It is a wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency\n",
|
||||
"- Note that it is a tool for using LLMs to generate text (inference), not training or finetuning LLMs\n",
|
||||
"- Prior to running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
|
||||
"- Linux users can use the installation command provided on the ollama website\n",
|
||||
"\n",
|
||||
"- In general, before we can use ollama from the command line, we have to either start the ollama application or run `ollama serve` in a separate terminal\n",
|
||||
"\n",
|
||||
"<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1\">\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- With the ollama application or `ollama serve` running, in a different terminal, on the command line, execute the following command to try out the 8-billion-parameter Llama 3 model (the model, which takes up 4.7 GB of storage space, will be automatically downloaded the first time you execute this command)\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# 8B model\n",
|
||||
"ollama run llama3\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The output looks like as follows:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"$ ollama run llama3\n",
|
||||
"pulling manifest \n",
|
||||
"pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB \n",
|
||||
"pulling 4fa551d4f938... 100% ▕████████████████▏ 12 KB \n",
|
||||
"pulling 8ab4849b038c... 100% ▕████████████████▏ 254 B \n",
|
||||
"pulling 577073ffcc6c... 100% ▕████████████████▏ 110 B \n",
|
||||
"pulling 3f8eb4da87fa... 100% ▕████████████████▏ 485 B \n",
|
||||
"verifying sha256 digest \n",
|
||||
"writing manifest \n",
|
||||
"removing any unused layers \n",
|
||||
"success \n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"- Note that `llama3` refers to the instruction finetuned 8-billion-parameter Llama 3 model\n",
|
||||
"\n",
|
||||
"- Alternatively, you can also use the larger 70-billion-parameter Llama 3 model, if your machine supports it, by replacing `llama3` with `llama3:70b`\n",
|
||||
"\n",
|
||||
"- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
|
||||
"\n",
|
||||
"- Try a prompt like \"What do llamas eat?\", which should return an output similar to the following:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
">>> What do llamas eat?\n",
|
||||
"Llamas are ruminant animals, which means they have a four-chambered \n",
|
||||
"stomach and eat plants that are high in fiber. In the wild, llamas \n",
|
||||
"typically feed on:\n",
|
||||
"1. Grasses: They love to graze on various types of grasses, including tall \n",
|
||||
"grasses, wheat, oats, and barley.\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b5addcb-fc7d-455d-bee9-6cc7a0d684c7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- You can end this session using the input `/bye`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dda155ee-cf36-44d3-b634-20ba8e1ca38a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using Ollama's REST API"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, an alternative way to interact with the model is via its REST API in Python via the following function\n",
|
||||
"- Before you run the next cells in this notebook, make sure that ollama is still running, as described above, via\n",
|
||||
" - `ollama serve` in a terminal\n",
|
||||
" - the ollama application\n",
|
||||
"- Next, run the following code cell to query the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- First, let's try the API with a simple example to make sure it works as intended:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import urllib.request\n",
|
||||
"import json\n",
|
||||
"\n",
|
||||
"def query_model(prompt, model=\"llama3\", url=\"http://localhost:11434/api/chat\", role=\"user\"):\n",
|
||||
" # Create the data payload as a dictionary\n",
|
||||
" data = {\n",
|
||||
" \"model\": model,\n",
|
||||
" \"seed\": 123, # for deterministic responses\n",
|
||||
" \"temperature\": 1., # for deterministic responses\n",
|
||||
" \"top_p\": 1, \n",
|
||||
" \"messages\": [\n",
|
||||
" {\"role\": role, \"content\": prompt}\n",
|
||||
" ]\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Convert the dictionary to a JSON formatted string and encode it to bytes\n",
|
||||
" payload = json.dumps(data).encode(\"utf-8\")\n",
|
||||
"\n",
|
||||
" # Create a request object, setting the method to POST and adding necessary headers\n",
|
||||
" request = urllib.request.Request(url, data=payload, method=\"POST\")\n",
|
||||
" request.add_header(\"Content-Type\", \"application/json\")\n",
|
||||
"\n",
|
||||
" # Send the request and capture the response\n",
|
||||
" response_data = \"\"\n",
|
||||
" with urllib.request.urlopen(request) as response:\n",
|
||||
" # Read and decode the response\n",
|
||||
" while True:\n",
|
||||
" line = response.readline().decode(\"utf-8\")\n",
|
||||
" if not line:\n",
|
||||
" break\n",
|
||||
" response_json = json.loads(line)\n",
|
||||
" response_data += response_json[\"message\"][\"content\"]\n",
|
||||
"\n",
|
||||
" return response_data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "4fb61a4e-2706-431a-835e-7e472b42989e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:\n",
|
||||
"\n",
|
||||
"1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.\n",
|
||||
"2. Hay: They enjoy eating hay, such as alfalfa or timothy hay, which provides them with fiber, protein, and other essential nutrients.\n",
|
||||
"3. Grains: Llamas may eat grains like oats, barley, or corn as a supplement to their diet.\n",
|
||||
"4. Leaves: They will also munch on leaves from trees and shrubs, including clover, alfalfa, and various types of leaves.\n",
|
||||
"5. Fruits and vegetables: In the wild, llamas might eat fruits and vegetables that grow in their natural habitat, such as apples, carrots, or potatoes.\n",
|
||||
"\n",
|
||||
"In general, a llama's diet should consist of:\n",
|
||||
"\n",
|
||||
"* 50% grasses and hay\n",
|
||||
"* 20% grains (like oats or corn)\n",
|
||||
"* 10% leaves and other plant material\n",
|
||||
"* 5% fruits and vegetables (as treats)\n",
|
||||
"\n",
|
||||
"It's essential to provide llamas with a balanced diet that meets their nutritional needs, as they can be prone to health issues if they don't receive the right combination of nutrients.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = query_model(\"What do Llamas eat?\")\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8c079c6c-5845-4b31-a648-060d0273cd1d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Extract Instructions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "92b09132-4a92-4492-9b67-24a644767498",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, let's use the \"hack\" proposed in the paper: we provide the empty prompt template `\"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\"` prompt, which will cause the instruction-finetuned Llama 3 model to generate an instruction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "d7320a41-ed86-49e9-8eb1-5d609a82ad74",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def extract_instruction(text):\n",
|
||||
" for content in text.split(\"\\n\"):\n",
|
||||
" if content:\n",
|
||||
" return content.strip()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "bc41b72f-a8cf-4367-b0ca-0bf8c1f094fd",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"I am trying to find a way to make my child's birthday party more special and unique. What are some creative ideas you have?\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\"\n",
|
||||
"\n",
|
||||
"result = query_model(query, role=\"assistant\")\n",
|
||||
"instruction = extract_instruction(result)\n",
|
||||
"print(instruction)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "04d04ba7-bffc-47f0-87dc-d60fc676b14a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- As we can see above, surprisingly, the model indeed generated an instruction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Generate Responses"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2542f8d3-2db2-4a89-ae50-8825eb19d3b6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Now, the next step is to create the corresponding response, which can be done by simply passing the instruction as input"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "2349eb06-710f-4459-8a03-1a3b2e1e8905",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"What an exciting question! I'd be delighted to help you come up with some creative and unique ideas to make your child's birthday party truly special!\n",
|
||||
"\n",
|
||||
"Here are a few ideas to get you started:\n",
|
||||
"\n",
|
||||
"1. **Themed Scavenger Hunt**: Plan a scavenger hunt based on the birthday child's favorite theme (e.g., superheroes, animals, or princesses). Hide clues and challenges throughout the party area, leading up to a final surprise.\n",
|
||||
"2. **DIY Crafts Station**: Set up a craft station where kids can create their own party favors, such as customized t-shirts, crowns, or jewelry. This activity encourages creativity and makes for a memorable keepsake.\n",
|
||||
"3. **Mystery Box Challenge**: Fill mystery boxes with different textures, smells, and sounds. Have the kids guess what's inside each box without looking. This game promotes problem-solving and teamwork.\n",
|
||||
"4. **Indoor Camping Adventure**: Set up a cozy indoor \"camping\" area with sleeping bags, flashlights, and s'mores-making stations. Kids can enjoy a camping experience without leaving the party location.\n",
|
||||
"5. **Personalized Photo Booth**: Create a customized photo booth with props and backdrops that match the birthday child's theme. This activity allows kids to take home special memories and share them on social media.\n",
|
||||
"6. **Foodie Fun**: Plan a cooking or baking station where kids can make their own treats, such as cupcakes, pizzas, or trail mix. This activity teaches valuable skills and lets kids enjoy their creations.\n",
|
||||
"7. **Outdoor Movie Night**: Set up an outdoor movie screen (or projector) with cozy seating and snacks. Screen the birthday child's favorite film or a classic kid-friendly movie.\n",
|
||||
"8. **Science Experiments**: Host a science-themed party where kids can conduct fun experiments, such as making slime, creating lava lamps, or growing crystals.\n",
|
||||
"9. **Karaoke Contest**: Set up a karaoke machine with popular kids' songs and have a singing competition. Offer prizes for the best performances, and provide fun props like microphones and costumes.\n",
|
||||
"10. **Time Capsule Ceremony**: Have each guest bring a small item that represents their favorite memory or something they're looking forward to in the future. Bury the time capsule together as a group, with instructions to open it on a specific date (e.g., next year's birthday party).\n",
|
||||
"11. **Special Guest Appearance**: Arrange for a special guest, such as a superhero, princess, or even a real-life animal (if feasible), to make an appearance at the party.\n",
|
||||
"12. **Customized Games**: Design custom games and activities that fit the birthday child's interests and personality. This could include a customized version of a favorite game or a new game altogether.\n",
|
||||
"\n",
|
||||
"Remember, the key to making your child's birthday party unique is to incorporate elements that reflect their personality and interests. Mix and match these ideas or come up with something entirely new – the possibilities are endless!\n",
|
||||
"\n",
|
||||
"What do you think? Is there anything in particular that resonates with you, or would you like more suggestions?\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"response = query_model(instruction, role=\"user\")\n",
|
||||
"print(response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b12cf92c-3272-4b36-ae30-d1135af56328",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Generate Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "470037f3-64f4-4465-9f00-55b69e883a04",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- We can scale up this approach to an arbitrary number of data samples (you may want to apply some optional filtering length or quality (e.g., using another LLM to rate the generated data)\n",
|
||||
"- Below, we generate 5 synthetic instruction-response pairs, which takes about 3 minutes on an M3 MacBook Air\n",
|
||||
"- (To generate a dataset suitable for instruction finetuning, we want to increase this to at least 1k to 50k and perhaps run it on a GPU to generate the examples in a more timely fashion)\n",
|
||||
"\n",
|
||||
"**Tip**\n",
|
||||
"\n",
|
||||
"- You can generate even higher-quality responses by changing `model=\"llama3\"` to `model=\"llama3:70b\"`, however, this will require more computational resources"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "3b9e94ab-02ef-4372-91cd-60128159fd83",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|█████████████████████████████████████████████| 5/5 [02:37<00:00, 31.41s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"dataset_size = 5\n",
|
||||
"dataset = []\n",
|
||||
"\n",
|
||||
"for i in tqdm(range(dataset_size)):\n",
|
||||
"\n",
|
||||
" result = query_model(query, role=\"assistant\")\n",
|
||||
" instruction = extract_instruction(result)\n",
|
||||
" response = query_model(instruction, role=\"user\")\n",
|
||||
" entry = {\n",
|
||||
" \"instruction\": instruction,\n",
|
||||
" \"output\": response\n",
|
||||
" }\n",
|
||||
" dataset.append(entry)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "5fdbc194-c12a-4138-96d1-51bf66ca1574",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open(\"instruction-data-llama3-7b.json\", \"w\") as file:\n",
|
||||
" json.dump(dataset, file, indent=4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "b4027ead-bba4-49b7-9965-47532c3fdeee",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the significance of the number 7 in various cultures and religions?\",\n",
|
||||
" \"output\": \"The number 7 has been a significant and recurring theme across many cultures and religions, often imbuing it with special meaning and symbolism. Here are some examples:\\n\\n1. **Numerology**: In numerology, the number 7 is considered sacred and mystical, associated with spiritual awakening, introspection, and enlightenment.\\n2. **Judaism**: The Torah has seven days of creation, seven weeks in the wilderness, and seven years of rest (Sabbatical year). Seven is also a symbol of completion or perfection.\\n3. **Christianity**: In Christianity, there are seven deadly sins, seven virtues, and seven sacraments. Jesus was said to have spoken seven sermons, and the number 7 appears in various biblical accounts, such as the seven days of creation and the seven angels who appear before God.\\n4. **Islam**: In Islamic tradition, there are seven heavens, seven earths, and seven veils that separate the physical world from the divine realm. The Quran mentions seven verses (Ayats) that were revealed to Muhammad in a single revelation.\\n5. **Hinduism**: In Hindu mythology, there are seven chakras (energy centers) in the human body, seven colors of the rainbow, and seven planets (including the sun and moon). The number 7 is also associated with the seven aspects of the divine feminine (Saptamatrikas).\\n6. **Buddhism**: In Buddhist cosmology, there are seven levels of existence (dhatus) and seven stages of enlightenment (bodhisattva path). The seven days of meditation retreats in Tibetan Buddhism are also a nod to this number.\\n7. **Ancient Greek and Roman cultures**: The number 7 was considered perfect or complete, with seven being the number of planets known at the time (including Earth, Sun, Moon, Mars, Mercury, Jupiter, and Venus). Seven days were also the typical duration of a week in these cultures.\\n8. **Chinese culture**: In Chinese astrology, there are seven celestial stems and 12 earthly branches, creating a system of cyclical time reckoning. The number 7 is also associated with good luck and prosperity.\\n9. **African cultures**: In some African traditions, the number 7 is associated with spiritual powers, such as divination and healing. Seven is also considered a sacred number in Yoruba culture, representing the seven orishas (deities).\\n10. **Esoteric traditions**: Many esoteric traditions, including Hermeticism, Gnosticism, and Theosophy, have employed the number 7 to represent spiritual principles, such as the seven planes of existence, seven chakras, or seven sacred seals.\\n\\nThese examples illustrate the widespread significance of the number 7 across various cultures and religions. While the specific meanings may vary, the number 7 often represents completion, perfection, spiritual growth, and mystical connections.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"I've been wondering, what is the most important thing you can learn from a child?\",\n",
|
||||
" \"output\": \"What a great question!\\n\\nChildren have so much to teach us, and I think one of the most important things we can learn from them is the power of simplicity, curiosity, and joy. Children are often unencumbered by the complexities and worries that come with adulthood, and they approach life with a fresh perspective.\\n\\nHere are some specific things children can teach us:\\n\\n1. **Simplicity**: Children are masters of simplifying things. They don't get bogged down in details or worry about what others think. They just enjoy the moment and find joy in simple things like playing outside, drawing, or reading.\\n2. **Curiosity**: Kids are naturally curious and love to explore their world. They ask questions, investigate, and learn from experience. This curiosity can be a powerful reminder to us adults to stay open-minded and seek new knowledge and experiences.\\n3. **Joy and playfulness**: Children have an amazing capacity for joy and playfulness. They find happiness in the simplest things, like playing with blocks or having a water balloon fight. We could all benefit from embracing our inner child and finding more joy in life's simple pleasures.\\n4. **Unconditional love**: Children are capable of loving unconditionally, without judgment or expectation. This can be a powerful reminder to us adults to practice self-love, self-acceptance, and kindness towards others.\\n5. **Creativity and imagination**: Kids are incredibly creative and imaginative, often thinking outside the box and coming up with innovative solutions. This creativity can inspire us adults to think differently, try new things, and approach problems from unique angles.\\n6. **Resilience**: Children are surprisingly resilient in the face of challenges and setbacks. They learn to adapt, cope, and move forward, teaching us valuable lessons about perseverance and bouncing back from adversity.\\n\\nIn summary, learning from children can help us regain a sense of simplicity, curiosity, joy, and playfulness, while also reminding us of the importance of unconditional love, creativity, and resilience.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": null,\n",
|
||||
" \"output\": \"\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"What is the best way to deal with a difficult person?\",\n",
|
||||
" \"output\": \"Dealing with a difficult person can be challenging, but there are strategies that can help you navigate the situation effectively. Here are some tips:\\n\\n1. **Stay calm**: Take a deep breath and try not to take their behavior personally. A calm demeanor can help de-escalate tensions and prevent misunderstandings.\\n2. **Listen actively**: Sometimes, people act out because they feel unheard or misunderstood. Make an effort to listen carefully to what they're saying, and respond thoughtfully.\\n3. **Set boundaries**: Establish clear limits on what you are and aren't willing to engage in. Be firm but respectful when communicating your needs.\\n4. **Avoid taking the bait**: Don't let their provocations get under your skin. Stay focused on the issue at hand and avoid getting drawn into an argument or debate.\\n5. **Use \\\"I\\\" statements**: When expressing yourself, use \\\"I\\\" statements instead of \\\"you\\\" statements, which can come across as accusatory. This helps to reduce defensiveness and promotes a more constructive conversation.\\n6. **Practice empathy**: Try to understand where the other person is coming from, even if you don't agree with their perspective. Showing that you care about their feelings can help diffuse tension.\\n7. **Don't take it personally**: Remember that the difficult person's behavior is often a reflection of themselves, not you. Keep your self-worth and confidence intact.\\n8. **Seek common ground**: Look for areas of agreement or shared interests. This can help to build bridges and create a more positive atmosphere.\\n9. **Use humor (carefully)**: A well-timed, lighthearted joke or witty remark can help diffuse tension and lighten the mood. However, be cautious not to offend or make light of serious issues.\\n10. **Know when to disengage**: If the situation becomes too heated or toxic, it may be necessary to take a step back and re-engage at a later time when emotions have cooled down.\\n11. **Seek support**: Don't be afraid to ask for help from friends, family, or a professional if you're struggling to manage your interactions with a difficult person.\\n12. **Practice self-care**: Take care of yourself physically, emotionally, and mentally. Engage in activities that bring you joy and help you maintain your energy and resilience.\\n\\nRemember, dealing with a difficult person is not about winning an argument or changing their behavior; it's about maintaining your own emotional well-being and responding constructively to the situation.\"\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"instruction\": \"I'm looking for a way to get my cat's attention when they're hiding under the bed or in a closet.\",\n",
|
||||
" \"output\": \"The classic \\\"where'd my cat go?\\\" conundrum! Don't worry, I've got some tips to help you coax your kitty out from their hiding spots:\\n\\n1. **Use their favorite treats**: Cats love food, and familiar treats can be a powerful lure. Try calling your cat's name and saying \\\"treat time\\\" in a playful tone. This might encourage them to emerge and investigate.\\n2. **Make some noise**: Cats have poor eyesight but excellent hearing. Gently knock on the bed frame or closet door with your knuckles, making a soft, rhythmic sound. This can help your cat pinpoint where you are and entice them to come out.\\n3. **Speak softly and calmly**: When speaking to your cat, use a gentle, soothing tone. Avoid loud or harsh voices, as these might scare them even further into hiding.\\n4. **Use verbal cues**: Establish a consistent verbal cue, like \\\"come on out\\\" or \\\"let's play,\\\" which can become associated with the idea of leaving their hiding spot.\\n5. **Create a \\\"safe zone\\\"**: If your cat is hiding due to fear or anxiety (e.g., from loud noises or other pets), try creating a safe, comfortable space for them to emerge into. This could be a cozy blanket or a familiar toy.\\n6. **Patiently wait it out**: Sometimes, cats just need time and space to feel secure enough to come out. Give your cat the opportunity to leave their hiding spot at their own pace.\\n7. **Use a flashlight (carefully)**: If your cat is hiding in a dark space, try using a flashlight to create a gentle beam of light. Be cautious not to shine it directly into their eyes, as this could startle them further.\\n8. **Offer a familiar object**: Place a familiar toy or blanket near the entrance to the hiding spot, which can help your cat feel more comfortable coming out.\\n9. **Make the space inviting**: If your cat is hiding under the bed, try moving any clutter or dust bunnies away from the area. Make the space underneath the bed a pleasant place for them to emerge into.\\n10. **Be patient and don't force it**: Respect your cat's boundaries and allow them to come out when they're ready. Forcing them to leave their hiding spot can create negative associations and make them more likely to hide in the future.\\n\\nRemember, every cat is different, so try a combination of these methods to see what works best for your feline friend.\"\n",
|
||||
" }\n",
|
||||
"]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!cat instruction-data-llama3-7b.json"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
15
Codes/ch07/README.md
Normal file
15
Codes/ch07/README.md
Normal file
@ -0,0 +1,15 @@
|
||||
# Chapter 7: Finetuning to Follow Instructions
|
||||
|
||||
## Main Chapter Code
|
||||
|
||||
- [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions
|
||||
|
||||
## Bonus Materials
|
||||
|
||||
- [02_dataset-utilities](02_dataset-utilities) contains utility code that can be used for preparing an instruction dataset.
|
||||
|
||||
- [03_model-evaluation](03_model-evaluation) contains utility code for evaluating instruction responses using a local Llama 3 model and the GPT-4 API.
|
||||
|
||||
- [04_preference-tuning-with-dpo](04_preference-tuning-with-dpo) implements code for preference finetuning with Direct Preference Optimization (DPO)
|
||||
|
||||
- [05_dataset-generation](05_dataset-generation) contains code to generate synthetic datasets for instruction finetuning
|
||||
Loading…
Reference in New Issue
Block a user