Data Science & ML Development: Using Claudio with Machine Learning Projects¶
Time: 25-35 minutes
This tutorial explains how to use Claudio effectively with data science and machine learning projects, covering Jupyter notebooks, experiment tracking, model development, and GPU resource management.
Overview¶
Claudio's git worktree architecture provides unique benefits for ML projects:
- Experiment isolation: Each worktree can run different experiments
- Model versioning: Track model iterations across branches
- Notebook management: Work on multiple notebooks simultaneously
- Data pipeline separation: Develop data processing in parallel
- Resource coordination: Manage GPU allocation across experiments
Prerequisites¶
- Claudio initialized in your project (
claudio init) - Python environment with ML libraries (PyTorch/TensorFlow/scikit-learn)
- Jupyter Lab/Notebook or VS Code with Jupyter extension
- Familiarity with basic Claudio operations (see Quick Start)
Understanding ML Projects and Git Worktrees¶
Typical ML Project Structure¶
ml-project/
├── data/ # Data directory (often gitignored)
├── notebooks/ # Jupyter notebooks
│ ├── exploration/ # EDA notebooks
│ ├── training/ # Training notebooks
│ └── evaluation/ # Evaluation notebooks
├── src/
│ ├── data/ # Data loading and processing
│ ├── models/ # Model architectures
│ ├── training/ # Training loops
│ └── utils/ # Utilities
├── experiments/ # Experiment configs
├── outputs/ # Model checkpoints, logs
├── tests/ # Unit tests
├── requirements.txt
└── pyproject.toml
Worktree Considerations¶
Each worktree needs: - Isolated virtual environment: Different dependencies per experiment - Separate output directories: Model checkpoints, logs - Experiment configuration: Hyperparameters, data paths - Optionally shared data: Large datasets can be symlinked
Strategy 1: Experiment-Based Development (Recommended)¶
Best for: Comparing different model architectures or hyperparameters.
Concept¶
Each instance runs a different experiment:
Instance 1: Baseline model training
Instance 2: Model with attention mechanism
Instance 3: Model with different optimizer
Instance 4: Data augmentation experiments
Workflow¶
Task 1 - Baseline:
Train baseline model without attention.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Experiment:
1. Create experiments/baseline.yaml with config
2. Train: python src/training/train.py --config experiments/baseline.yaml
3. Log metrics to experiments/baseline/metrics.json
4. Evaluate: python src/evaluation/evaluate.py --model outputs/baseline/model.pt
Task 2 - Attention Model:
Train model with attention mechanism.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Experiment:
1. Implement attention in src/models/attention.py
2. Create experiments/attention.yaml
3. Train: python src/training/train.py --config experiments/attention.yaml
4. Compare metrics with baseline
Experiment Configuration¶
Structure experiments for parallel development:
# experiments/baseline.yaml
model:
name: baseline
hidden_size: 256
layers: 4
training:
epochs: 100
batch_size: 32
learning_rate: 0.001
output:
dir: outputs/${experiment_name}
checkpoint_freq: 10
Strategy 2: Pipeline-Based Development¶
Best for: Data pipeline and feature engineering work.
Task Assignment¶
Instance 1 - Data Loading:
Implement data loading pipeline.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/data/dataset.py with PyTorch Dataset
2. Add data augmentation transforms
3. Write unit tests
pytest tests/data/
Instance 2 - Feature Engineering:
Implement feature engineering pipeline.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/data/features.py with feature extractors
2. Add feature normalization
3. Write validation tests
pytest tests/data/
Instance 3 - Preprocessing:
Implement data preprocessing.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/data/preprocess.py
2. Add cleaning and validation
3. Create preprocessing pipeline
pytest tests/data/
Strategy 3: Model Architecture Exploration¶
Best for: Neural architecture search and model comparison.
Parallel Architecture Development¶
Instance 1 - CNN Architecture:
Implement CNN model architecture.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/models/cnn.py
2. Add residual connections
3. Implement forward pass
4. Test with dummy data
pytest tests/models/
Instance 2 - Transformer Architecture:
Implement Transformer model.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/models/transformer.py
2. Implement multi-head attention
3. Add positional encoding
4. Test with dummy data
pytest tests/models/
Jupyter Notebook Management¶
Notebook Isolation¶
Each worktree can run Jupyter independently:
Start Jupyter in this worktree:
source .venv/bin/activate
jupyter lab --port 8889 --no-browser
# Different port per worktree:
# Worktree 1: port 8888
# Worktree 2: port 8889
# Worktree 3: port 8890
Notebook Version Control¶
Use jupytext for cleaner diffs:
Set up jupytext for notebook versioning:
pip install jupytext
jupytext --set-formats ipynb,py:percent notebooks/*.ipynb
# Now .py files are tracked in git
# .ipynb can be gitignored or tracked
Notebook Tasks¶
Task 1 - EDA Notebook:
Create data exploration notebook.
Setup:
source .venv/bin/activate
pip install -r requirements.txt
Create notebooks/exploration/data_eda.py:
1. Load and describe dataset
2. Visualize distributions
3. Identify missing values
4. Document findings
Convert to notebook:
jupytext notebooks/exploration/data_eda.py --to notebook
Task 2 - Training Notebook:
Create training experiment notebook.
Setup:
source .venv/bin/activate
pip install -r requirements.txt
Create notebooks/training/experiment_01.py:
1. Set up experiment tracking
2. Define training loop
3. Log metrics and artifacts
4. Save checkpoints
jupytext notebooks/training/experiment_01.py --to notebook
GPU Resource Management¶
Single GPU Coordination¶
When multiple worktrees share a GPU:
Option A: Sequential Training
claudio add "Train baseline model" --start
claudio add "Train attention model" --depends-on "baseline"
Option B: GPU Memory Allocation
Option C: CPU for Development
Multi-GPU Setup¶
With multiple GPUs, assign per worktree:
Worktree 1:
Worktree 2:
Remote GPU Resources¶
For cloud GPU usage:
Submit training job to cloud:
# AWS SageMaker
python scripts/submit_sagemaker.py --config experiments/baseline.yaml
# Google Cloud AI Platform
python scripts/submit_vertex.py --config experiments/baseline.yaml
Experiment Tracking¶
MLflow Integration¶
Set up MLflow experiment tracking:
Setup:
pip install mlflow
Usage in training:
import mlflow
mlflow.set_experiment("model-comparison")
with mlflow.start_run(run_name="baseline"):
mlflow.log_params(config)
# ... training loop
mlflow.log_metric("accuracy", accuracy)
mlflow.log_artifact("model.pt")
Weights & Biases Integration¶
Set up W&B tracking:
Setup:
pip install wandb
wandb login
Usage:
import wandb
wandb.init(project="my-project", name="baseline-experiment")
wandb.config.update(config)
# ... training loop
wandb.log({"loss": loss, "accuracy": accuracy})
DVC Integration¶
For data and model versioning:
Set up DVC for data versioning:
pip install dvc
dvc init
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
Data Management¶
Large Dataset Handling¶
For large datasets, symlink to shared location:
# Create shared data directory
mkdir -p /data/shared/ml-project
# Symlink in each worktree
ln -s /data/shared/ml-project/data ./data
Task instruction:
This worktree uses shared data.
Data location: /data/shared/ml-project/data
Symlink created: ./data -> /data/shared/ml-project/data
Do not modify files in ./data directly.
Create processed versions in ./processed/
Data Version Control¶
Track data versions with DVC:
dvc pull # Get data for this experiment
dvc run -n preprocess -d data/raw -o data/processed python preprocess.py
Testing Strategies¶
Unit Tests¶
Test model components independently:
Task 1:
Task 2:
Integration Tests¶
Test full training pipeline:
Model Validation¶
Validate model outputs:
Run model validation:
python src/evaluation/validate.py --model outputs/model.pt
python src/evaluation/sanity_check.py --model outputs/model.pt
Common Conflict Points¶
File Conflicts¶
| File | Risk | Mitigation |
|---|---|---|
requirements.txt | HIGH | Coordinate dependency changes |
pyproject.toml | HIGH | One instance for config changes |
Notebooks (.ipynb) | HIGH | Use jupytext, different notebooks |
| Model code | LOW | Different model files |
| Experiment configs | LOW | Different experiment names |
Task Design for ML¶
Good decomposition:
├── src/data/ (Data team)
├── src/models/ (Model team)
├── src/training/ (Training team)
└── notebooks/ (Different notebooks)
Risky decomposition:
├── Full experiment 1 (touches all files)
├── Full experiment 2 (touches all files)
└── Full experiment 3 (touches all files)
Environment Management¶
Conda Environments¶
Create conda environment for this worktree:
conda create -p ./.conda python=3.11 -y
conda activate ./.conda
pip install -r requirements.txt
Poetry¶
Docker¶
Build and run in Docker:
docker build -t ml-experiment:baseline .
docker run --gpus all -v $(pwd)/data:/app/data ml-experiment:baseline
Example: Complete ML Feature¶
Scenario¶
Implementing a new model with: - Data preprocessing pipeline - Model architecture - Training loop - Evaluation metrics
Session Setup¶
Tasks¶
Task 1 - Data Pipeline:
Implement data preprocessing.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/data/preprocess.py with:
- Data loading from CSV
- Missing value handling
- Feature normalization
- Train/val/test split
2. Create src/data/dataset.py with:
- PyTorch Dataset class
- Data augmentation
Test:
pytest tests/data/ -v
Task 2 - Model Architecture:
Implement model architecture.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/models/new_model.py with:
- Model class
- Forward pass
- Weight initialization
2. Create config in experiments/new_model.yaml
Test:
pytest tests/models/ -v
Task 3 - Training Loop:
Implement training pipeline.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/training/trainer.py with:
- Training loop
- Validation step
- Checkpointing
- Early stopping
2. Add experiment tracking
Test:
python src/training/train.py --config experiments/new_model.yaml --epochs 1 --debug
Task 4 - Evaluation:
Implement evaluation metrics.
Setup:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Implementation:
1. Create src/evaluation/metrics.py with:
- Accuracy, precision, recall, F1
- Confusion matrix
- ROC/AUC
2. Create src/evaluation/evaluate.py
Test:
python src/evaluation/evaluate.py --model outputs/new_model/checkpoint.pt
Configuration Recommendations¶
For ML projects:
# ~/.config/claudio/config.yaml
# ML training can take a long time
instance:
activity_timeout_minutes: 120
completion_timeout_minutes: 240
# Assign reviewers by area
pr:
reviewers:
by_path:
"src/data/**": [data-team]
"src/models/**": [ml-team]
"notebooks/**": [data-science-team]
"experiments/**": [ml-team, tech-lead]
"requirements*.txt": [tech-lead]
# ML development can be very expensive
resources:
cost_warning_threshold: 25.00
CI Integration¶
Example GitHub Actions workflow:
name: ML CI
on:
pull_request:
paths:
- '**/*.py'
- 'requirements*.txt'
- 'experiments/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt -r requirements-dev.txt
- name: Lint
run: |
ruff check .
black --check .
- name: Unit tests
run: pytest tests/ -v --ignore=tests/integration
- name: Model smoke test
run: |
python -c "from src.models import NewModel; m = NewModel(); print('Model loads OK')"
integration:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Training smoke test
run: |
python src/training/train.py \
--config experiments/test.yaml \
--epochs 1 \
--batch-size 2 \
--device cpu
Troubleshooting¶
CUDA out of memory¶
Training exhausting GPU memory.
Solution:
# Reduce batch size
--batch-size 16
# Gradient accumulation
--accumulation-steps 4
# Mixed precision training
--fp16
Package version conflicts¶
Different package versions needed.
Solution:
Notebook kernel not found¶
Kernel not matching environment.
Solution:
Data not found¶
Symlink or path issues.
Solution:
Experiment tracking conflicts¶
Multiple runs with same name.
Solution:
What You Learned¶
- Experiment isolation strategies
- Jupyter notebook management in worktrees
- GPU resource coordination
- Experiment tracking integration
- Data management for ML projects
- CI integration patterns
Next Steps¶
- Python Development - Python-specific patterns
- Full-Stack Development - ML with web services
- Configuration Guide - Customize for your team