Skip to content

A template for fine-tuning 4-bit LLMs on a single gpu (Unsloth) and multi-GPU (FSDP) fleet via SageMaker. Deployed with Terraform.

License

Notifications You must be signed in to change notification settings

codeamt/FSDP-Multi-GPU-Training

Repository files navigation

FSDP-Multi-GPU-Training

Overview

A production-ready template for training LLMs with two strategies:

  • FSDP for multi-GPU sharded training and sharded checkpoints
  • Unsloth for efficient 4-bit fine-tuning on a single/multi-GPU node

It is designed for AWS SageMaker Spot with preemption-safe saves (SIGTERM handling) and automatic resume from the latest checkpoint. Includes hardened dataloading (HF Hub and S3 parquet) and robust checkpoint sync to S3.

For non-developers (quick start): use the provided commands without changing code. For developers: code lives under src/ in a standard Python package layout.

Features

  • FSDP strategy with sharded initialization and sharded checkpoints
  • Unsloth strategy with 4-bit training and a fallback to Transformers
  • Preemption-safe: SIGTERM/SIGINT triggers emergency_stop() with a final checkpoint
  • Auto-resume: if --resume is not provided, uses the most recent checkpoint in checkpoint.output_dir
  • Hardened dataloader: HF datasets or S3 parquet, batched mapping with prompt templating
  • S3 checkpoint sync with retries and exponential backoff
  • Dockerfile with CUDA PyTorch, awscli, and s5cmd preinstalled

Repository Structure

  • src/fsdp_unsloth/
    • core/ trainers, strategy selection, security checks
    • common/ logging, memory, checkpoint utils, and config adapter/schema
  • scripts/
    • train.py CLI entry (thin wrapper; you can also use the installed CLI)
    • infer.py example inference script
    • configs/ example configs (FSDP/Unsloth + smoke)
  • .github/workflows/ GitHub Actions for CI and pre-commit

Setup (dev-friendly, using uv)

python -m venv .venv
. .venv/bin/activate
pip install uv
uv pip install -e ".[dev]"
pre-commit install

Secure Job Submission

  • A template notebook is provided at notebooks/secure_submit.ipynb which demonstrates:
    • Building SageMaker guardrails via scripts/core/security.py::build_sagemaker_guardrails()
    • Redacting secrets before logging configs
    • Merging guardrails into a job request (example boto3 call commented out)
  • Configure environment values using .env.example (copy to .env).
  • Optional: install python-dotenv (already in requirements) and load env in your scripts:
    from dotenv import load_dotenv
    load_dotenv()

Preflight Safety Checks

  • src/fsdp_unsloth/core/strategy_selector.py runs preflight checks (HF token format, S3/local path safety, W&B readiness) before trainer construction.

  • Enable strict mode to fail fast:

    security:
      strict_preflight: true
  • Docker image (recommended for SageMaker)

docker build -t unsloth-fsdp-training:latest .

Configs

  • Base schema: src/fsdp_unsloth/common/configs/base_config.yaml
  • Examples:
    • FSDP: scripts/configs/fsdp/llama-7b.yaml
    • Unsloth: scripts/configs/unsloth/finance-alpaca.yaml
    • Smoke tests: scripts/configs/{fsdp,unsloth}/smoke.yaml

Backend selection is explicit:

  • Set backend: fsdp or backend: unsloth at the top of the config.
  • CLI override available via --backend (alias of --strategy).

Key fields:

  • training.* (batch sizes, lr, steps)
  • checkpoint.save_interval, checkpoint.output_dir
  • logging.log_interval, logging.wandb_project
  • model.name, model.max_length, model.load_in_4bit, model.hf_token
  • fsdp.mixed_precision and other sharding params

Running Training

  • Using the installed CLI (recommended):
fsdp-train --config scripts/configs/fsdp/smoke.yaml --smoke
fsdp-train --config scripts/configs/unsloth/smoke.yaml --backend unsloth --smoke
  • Via provided script wrapper (equivalent):
python scripts/train.py --config scripts/configs/fsdp/llama-7b.yaml
  • Multi-GPU (torchrun):
make train-fsdp-mgpu NGPU=8
make train-unsloth-mgpu NGPU=8

Optional NCCL hints in Makefile (commented) for multi-node networking.

Checkpoints & Resume

  • FSDP saves sharded checkpoints into folders like checkpoint_<step>/ under checkpoint.output_dir.
  • Unsloth saves a single-file checkpoint checkpoint_<step>.bin.
  • Auto-resume (when --resume not provided): auto-detects the latest checkpoint in checkpoint.output_dir.
  • SageMaker: set CheckpointConfig (S3 URI). The trainer will sync to SM_CHECKPOINT_DIR automatically.

Data Loading

  • HF dataset: data.name = HF dataset ID, supports streaming.
  • S3 parquet: data.name = s3://bucket/path/file.parquet (parquet only). Uses s3fs.
  • Prompt templating: define data.prompt_template using {instruction}, {input}, {output}, {eos_token}.

SageMaker Spot Training

  • Spot preemption triggers SIGTERM; the trainer catches it and performs an emergency checkpoint save.
  • Recommended GPU instances:
    • FSDP: p4d.24xlarge (A100, 8x GPU) or p5.48xlarge (H100) for larger models
    • Unsloth: g5.12xlarge (A10G) or p4d.24xlarge depending on model size
  • Use CheckpointConfig for S3 checkpointing and enable Managed Spot Training. Ensure MaxWaitTimeInSeconds > MaxRuntimeInSeconds for queueing.

Smoke Tests

  • Minimal runs to validate wiring and error handling.
make train-unsloth-smoke
make train-fsdp-smoke  # requires GPU

Checkpoint Conversion Tool

Use scripts/tools/convert_checkpoint.py to convert between FSDP sharded directories and Unsloth single-file checkpoints.

  • Convert FSDP shards to a single Unsloth file:
python -m scripts.tools.convert_checkpoint \
  --source_path outputs/checkpoint_1000 \
  --target_path outputs/unsloth_1000.bin \
  --strategy fsdp --target_strategy unsloth
  • Convert Unsloth file to FSDP shards directory:
python -m scripts.tools.convert_checkpoint \
  --source_path outputs/unsloth_1000.bin \
  --target_path outputs/fsdp_1000 \
  --strategy unsloth --target_strategy fsdp
  • Inference with an optional checkpoint (file or shard dir):
python scripts/infer.py \
  --config scripts/configs/unsloth/smoke.yaml \
  --prompt "Hello" \
  --checkpoint outputs/unsloth_1000.bin

Contributing

  • Prereqs
    • Python 3.10+, CUDA drivers for GPU runs
    • HF credentials (HF_TOKEN) if using gated models/datasets
    • AWS credentials for S3 (optional for S3 paths)
  • Workflow
    • Branch from main, implement changes
    • Run format/lint/tests:
      pre-commit run --all-files
      pytest -v
    • Submit a PR with a concise description and test plan

License

  • This project is licensed under Apache License 2.0 (see LICENSE).

TODO (Continuous Improvement)

  • [docs] Add SageMaker job submission examples (Estimator config, Spot flags, CheckpointConfig)
  • [fsdp] Add richer sharding options in fsdp config (activation checkpointing policies, CPU offload)
  • [resume] Write a latest pointer file after each save to speed up auto-resume discovery
  • [inference] Validate and document scripts/infer.py for both strategies
  • [tests] Add CPU-only unit tests and a small CI workflow for lint + schema checks
  • [monitoring] Add optional CloudWatch/W&B guidance and Makefile targets for metrics sync
  • [datasets] Add JSONL and multi-file S3 dataset examples

About

A template for fine-tuning 4-bit LLMs on a single gpu (Unsloth) and multi-GPU (FSDP) fleet via SageMaker. Deployed with Terraform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published