7.5 KiB
Quick Start
Get MiniMind up and running in minutes!
📋 Requirements
Hardware
- GPU Memory: 8GB minimum (24GB recommended for comfortable development)
- Recommended GPU: NVIDIA RTX 3090 (24GB)
Software
- Python: 3.10+
- PyTorch: 2.0+ (with CUDA 12.2+ for GPU support)
- CUDA: 12.2+ (optional, for GPU acceleration)
!!! tip "Hardware Configuration Reference" - CPU: Intel i9-10980XE @ 3.00GHz - RAM: 128 GB - GPU: NVIDIA GeForce RTX 3090 (24GB) × 8 - OS: Ubuntu 20.04 - CUDA: 12.2 - Python: 3.10.16
🚀 Step 0: Clone the Repository
git clone https://github.com/jingyaogong/minimind.git
cd minimind
🎯 Section I: Testing Existing Models
1. Environment Setup
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
!!! warning "Verify CUDA Support"
After installation, verify PyTorch can access CUDA:
python import torch print(torch.cuda.is_available()) print(torch.cuda.get_device_name(0))
If False, download the correct PyTorch version from PyTorch Official
2. Download Pretrained Models
Choose one option:
From HuggingFace (recommended for international users):
git clone https://huggingface.co/jingyaogong/MiniMind2
From ModelScope (recommended for China users):
git clone https://www.modelscope.cn/models/gongjy/MiniMind2.git
3. Command-Line Chat
# load=0: load PyTorch model, load=1: load transformers model
python eval_model.py --load 1 --model_mode 2
Model Modes:
model_mode 0: Pretrain model (word continuation)model_mode 1: SFT Chat model (conversation)model_mode 2: RLHF model (refined responses, currently same as SFT for small models)model_mode 3: Reasoning model (with thinking chains)model_mode 4/5: RLAIF models (PPO/GRPO trained)
Example Session:
👶: Hello, please introduce yourself.
🤖️: I am MiniMind, an AI assistant developed by Jingyao Gong.
I use natural language processing and machine learning algorithms to interact with users.
👶: What's the capital of France?
🤖️: The capital of France is Paris, which is located in the northern central part of France.
It is the largest city in France and serves as its political, economic, and cultural center.
4. Web UI Demo (Optional)
# Requires Python >= 3.10
pip install streamlit
cd scripts
streamlit run web_demo.py
Visit http://localhost:8501 to use the interactive web interface.
5. Rope Length Extrapolation with YaRN
Extend context length beyond training with RoPE extrapolation:
python eval_model.py --inference_rope_scaling True
This enables the YaRN algorithm to handle sequences longer than the 2K training context, useful for processing documents and long conversations.
🔧 Third-Party Inference Frameworks
MiniMind is compatible with popular inference engines:
Ollama (Easiest)
ollama run jingyaogong/minimind2
vLLM (Fastest)
vllm serve ./MiniMind2/ --served-model-name "minimind" --port 8000
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimind",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 512
}'
llama.cpp (CPU-Friendly)
# Convert to GGUF format
python scripts/convert_model.py ./MiniMind2/ --output ./MiniMind2.gguf
# Quantize for size reduction
./llama-quantize ./MiniMind2.gguf ./MiniMind2-Q4.gguf Q4_K_M
# Run inference
./llama-cli -m ./MiniMind2-Q4.gguf -p "Hello" -n 128
🔌 OpenAI API Server (For Integration)
Run MiniMind as an OpenAI API-compatible service:
python scripts/serve_openai_api.py
Test the API:
# In another terminal
python scripts/chat_openai_api.py
cURL Example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimind",
"messages": [
{"role": "user", "content": "Explain machine learning in one sentence."}
],
"temperature": 0.7,
"max_tokens": 256,
"stream": true
}'
This enables integration with:
- FastGPT
- Open-WebUI
- Dify
- Any OpenAI API-compatible client
📊 Model Selection Guide
| Use Case | Recommended Model | Memory | Speed |
|---|---|---|---|
| Learning/Testing | MiniMind2-small (26M) | ~0.5 GB | Fastest |
| Balanced | MiniMind2 (104M) | ~1.0 GB | Fast |
| Expert System (MoE) | MiniMind2-MoE (145M) | ~1.0 GB | Dynamic |
| Reasoning/Complex | MiniMind-Reason (104M) | ~1.0 GB | Standard |
⚡ Quick Test Results
Model: MiniMind2 (104M parameters)
Q: What is photosynthesis?
A: Photosynthesis is a process in which plants convert light energy from the sun
into chemical energy to produce glucose. This process occurs mainly in leaves
and is essential for plant growth and survival.
Q: Write a Python function to calculate Fibonacci numbers.
A: def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
# For better performance, use dynamic programming:
def fibonacci_dp(n):
dp = [0] * (n + 1)
for i in range(2, n + 1):
dp[i] = dp[i-1] + dp[i-2]
return dp[n]
Q: 世界上最高的山峰是什么? (What is the highest mountain?)
A: 珠穆朗玛峰(Mount Everest)是世界上最高的山峰,位于喜马拉雅山脉...
(Mount Everest is the world's highest mountain, located in the Himalayas...)
🆘 Troubleshooting
Issue: CUDA Out of Memory
Solution:
# Reduce batch size
python eval_model.py --batch_size 1
# Or use CPU (slow but works)
python eval_model.py --device cpu
Issue: Slow Inference
Solutions:
- Use vLLM or llama.cpp for faster inference
- Enable quantization (4-bit, 8-bit)
- Use GPU instead of CPU
- Reduce
max_tokensparameter
Issue: Model Responses Are Poor Quality
Possible Causes:
- Using pretrain model (
model_mode 0) instead of SFT (model_mode 1) - Model is undertrained - download the full checkpoint instead
- Input prompt is too short - provide more context
Issue: Python/PyTorch Version Mismatch
Solution:
# Use conda for clean environment
conda create -n minimind python=3.10
conda activate minimind
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
pip install -r requirements.txt
📖 Next Steps
- Model Training Guide - Train your own MiniMind from scratch
- Source Code - Explore and learn LLM implementation
- Inference Benchmarks - See model performance comparisons
💡 Pro Tips
- GPU Memory Optimization: Use
torch.cuda.empty_cache()periodically - Batch Processing: For efficiency, process multiple prompts in batches
- Temperature Tuning: Lower (0.3-0.7) = more consistent, Higher (0.8-1.0) = more creative
- Prompt Engineering: Better prompts → better results, even for small models
- Model Quantization: Use 4-bit quantization to run on smaller GPUs
Done! Now you're ready to use MiniMind. Start with the Quick Start, then move to Model Training to learn how to train your own models.