Skip to content

Conversation

@viliakov
Copy link
Contributor

@viliakov viliakov commented Oct 24, 2025

This PR adds Stackgraph backup/restore functionality to the StackState Backup CLI and introduces a new
layered architecture for better code organization and maintainability.

Architecture Changes

The directory layout has been reorganized to follow a clean layered architecture with explicit dependency
rules. The new structure consists of 4 layers:

  • Layer 0 (internal/foundation/): Core utilities with no internal dependencies (config, logger,
    output)
  • Layer 1 (internal/clients/): Service client wrappers (Kubernetes, Elasticsearch, S3/Minio)
  • Layer 2 (internal/orchestration/): High-level workflows that coordinate multiple services
    (portforward, scale)
  • Layer 3 (cmd/): User-facing CLI commands

Each layer can only depend on lower layers, preventing circular dependencies and making the codebase more
maintainable and testable.

📖 For detailed architecture documentation, see ARCHITECTURE.md

New Commands

stackgraph list

Lists available Stackgraph backups from Minio S3 storage.

Example:

go run main.go stackgraph list --namespace stac-23374-nonha
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-nonha...
✓ Port-forward established successfully
Listing Stackgraph backups in bucket 'sts-stackgraph-backup'...
NAME                            LAST MODIFIED            SIZE
sts-backup-20251029-0300.graph  2025-10-29 03:01:31 UTC  293MiB
sts-backup-20251028-0300.graph  2025-10-28 03:00:56 UTC  166MiB
sts-backup-20251027-1138.graph  2025-10-27 11:39:28 UTC  107MiB
sts-backup-20251024-1130.graph  2025-10-24 11:31:19 UTC  50MiB
sts-backup-20251024-0953.graph  2025-10-24 09:53:44 UTC  35MiB

stackgraph restore

Restores a Stackgraph backup from Minio S3 storage with automatic deployment scaling and Kubernetes job
orchestration.

Restore Workflow

When you run stackgraph restore, the CLI performs the following steps:

  1. Backup Selection
    - If --latest flag is specified: Automatically fetches the most recent backup from Minio
    - If --archive flag is specified: Uses the explicitly provided archive name
  2. User Confirmation (unless --yes is used)
    - Warns that restore will PURGE all existing Stackgraph data
    - Displays backup file and namespace
    - Prompts for confirmation: Do you want to continue? (yes/no):
    - Use --yes or -y flag to skip confirmation (useful for automation)
  3. Scale Down Deployments
    - Identifies and scales down affected deployments to zero replicas
    - Waits for all pods to terminate gracefully
    - Stores original replica counts for restoration
  4. Create Kubernetes Resources
    - ConfigMap: Contains the restore script and configuration
    - Secret: Mounts Minio credentials for S3 access
    - PersistentVolumeClaim (PVC): Temporary storage for backup data
    - Job: Executes the restore operation in a pod with FORCE_DELETE="-force" environment variable
  5. Job Execution (conditional)
    - If --background flag is NOT set:
    • Waits for the Kubernetes Job to complete
    • Streams job logs to stdout in real-time
    • Reports success/failure status
      - If --background flag IS set:
      • Creates the job and returns immediately
    • User can monitor job status separately with kubectl
  6. Scale Up Deployments
    - Restores deployments to their original replica counts
    - Automatically triggered after job completion (or immediately if --background is used)
  7. Cleanup
    - Kubernetes Job is automatically cleaned up via TTL (10 minutes after completion)
    - PVC remains for troubleshooting and is cleaned up in the next restore

Usage

Usage:
sts-backup stackgraph restore [flags]

Flags:
--archive string Specific archive name to restore (e.g., sts-backup-20210216-0300.graph)
--background Run restore job in background without waiting for completion
-h, --help help for restore
--latest Restore from the most recent backup
-y, --yes Skip confirmation prompt

Example 1: Restore with Background Execution (Interactive).

go run main.go stackgraph restore --namespace stac-23374-nonha --archive sts-backup-20251029-0300.graph --background

Warning: WARNING: Restoring from backup will PURGE all existing Stackgraph data!
Warning: This operation cannot be undone.

Backup to restore: sts-backup-20251029-0300.graph
Namespace: stac-23374-nonha

Do you want to continue? (yes/no): yes

Scaling down deployments (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled down 1 deployment(s):
  - suse-observability-server (replicas: 0 -> 0)
Waiting for pods to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-backup-20251029-0300.graph
✓ Restore job created: stackgraph-restore-20251029t142254

Job is running in background: stackgraph-restore-20251029t142254

Monitoring commands:
  kubectl logs --follow job/stackgraph-restore-20251029t142254 -n stac-23374-nonha
  kubectl get job stackgraph-restore-20251029t142254 -n stac-23374-nonha

To wait for completion, scaling up the necessary deployments and cleanup, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 --wait -n stac-23374-nonha

Example 2: Restore Latest Backup with automatic approval

❯ go run main.go stackgraph restore --namespace stac-23374-ha --latest --yes
Finding latest backup...
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-ha...
✓ Port-forward established successfully
Using latest backup: sts-backup-20251028-1535.graph

Scaling down deployments (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled down 9 deployment(s):
  - suse-observability-api (replicas: 1 -> 0)
  - suse-observability-authorization-sync (replicas: 1 -> 0)
  - suse-observability-checks (replicas: 1 -> 0)
  - suse-observability-health-sync (replicas: 1 -> 0)
  - suse-observability-initializer (replicas: 1 -> 0)
  - suse-observability-notification (replicas: 1 -> 0)
  - suse-observability-slicing (replicas: 1 -> 0)
  - suse-observability-state (replicas: 1 -> 0)
  - suse-observability-sync (replicas: 1 -> 0)
Waiting for pods to terminate...
Waiting for 3 pod(s) to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-backup-20251028-1535.graph
✓ Restore job created: stackgraph-restore-20251029t145215

Waiting for restore job to complete (this may take several minutes)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t145215 --wait -n stac-23374-ha

✓ Restore completed successfully

Cleaning up job and PVC...
✓ Job deleted: stackgraph-restore-20251029t145215
✓ PVC deleted: stackgraph-restore-20251029t145215

Scaling up deployments from annotations (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled up 9 deployment(s) successfully:
  - suse-observability-api (replicas: 0 -> 1)
  - suse-observability-authorization-sync (replicas: 0 -> 1)
  - suse-observability-checks (replicas: 0 -> 1)
  - suse-observability-health-sync (replicas: 0 -> 1)
  - suse-observability-initializer (replicas: 0 -> 1)
  - suse-observability-notification (replicas: 0 -> 1)
  - suse-observability-slicing (replicas: 0 -> 1)
  - suse-observability-state (replicas: 0 -> 1)
  - suse-observability-sync (replicas: 0 -> 1)

stackgraph check-and-finalize

Check the status of a background Stackgraph restore job and clean up resources.

Usage
sts-backup stackgraph check-and-finalize --job [--wait] -n

Flags:

  • --job, -j - Stackgraph restore job name (required)
  • --wait, -w - Wait for job to complete before cleanup

Note: This command automatically scales up deployments that were scaled down during restore.

Example: Checking if the job is still running

❯ go run main.go stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 -n stac-23374-nonha
Checking status of job: stackgraph-restore-20251029t142254

Job is running in background: stackgraph-restore-20251029t142254
  Active pods: 1

Monitoring commands:
  kubectl logs --follow job/stackgraph-restore-20251029t142254 -n stac-23374-nonha
  kubectl get job stackgraph-restore-20251029t142254 -n stac-23374-nonha

To wait for completion, scaling up the necessary deployments and cleanup, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 --wait -n stac-23374-nonha

Example: Waiting for the job to finish

❯ go run main.go stackgraph check-and-finalize --job stackgraph-restore-20251029t162400 --wait -n stac-23374-nonha
Checking status of job: stackgraph-restore-20251029t162400

Waiting for restore job to complete (this may take several minutes)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t162400 --wait -n stac-23374-nonha

✓ Job completed successfully: stackgraph-restore-20251029t162400

Scaling up deployments from annotations (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled up 1 deployment(s) successfully:
  - suse-observability-server (replicas: 0 -> 1)

Cleaning up job and PVC...
✓ Job deleted: stackgraph-restore-20251029t162400
✓ PVC deleted: stackgraph-restore-20251029t162400

@viliakov viliakov merged commit 06be20e into main Oct 30, 2025
5 checks passed
@viliakov viliakov deleted the STAC-23598 branch October 30, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants