STAC-23598: Restoring Stackgraph #4

viliakov · 2025-10-24T08:27:53Z

This PR adds Stackgraph backup/restore functionality to the StackState Backup CLI and introduces a new
layered architecture for better code organization and maintainability.

Architecture Changes

The directory layout has been reorganized to follow a clean layered architecture with explicit dependency
rules. The new structure consists of 4 layers:

Layer 0 (internal/foundation/): Core utilities with no internal dependencies (config, logger,
output)
Layer 1 (internal/clients/): Service client wrappers (Kubernetes, Elasticsearch, S3/Minio)
Layer 2 (internal/orchestration/): High-level workflows that coordinate multiple services
(portforward, scale)
Layer 3 (cmd/): User-facing CLI commands

Each layer can only depend on lower layers, preventing circular dependencies and making the codebase more
maintainable and testable.

📖 For detailed architecture documentation, see ARCHITECTURE.md

New Commands

`stackgraph list`

Lists available Stackgraph backups from Minio S3 storage.

Example:

go run main.go stackgraph list --namespace stac-23374-nonha
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-nonha...
✓ Port-forward established successfully
Listing Stackgraph backups in bucket 'sts-stackgraph-backup'...
NAME                            LAST MODIFIED            SIZE
sts-backup-20251029-0300.graph  2025-10-29 03:01:31 UTC  293MiB
sts-backup-20251028-0300.graph  2025-10-28 03:00:56 UTC  166MiB
sts-backup-20251027-1138.graph  2025-10-27 11:39:28 UTC  107MiB
sts-backup-20251024-1130.graph  2025-10-24 11:31:19 UTC  50MiB
sts-backup-20251024-0953.graph  2025-10-24 09:53:44 UTC  35MiB

`stackgraph restore`

Restores a Stackgraph backup from Minio S3 storage with automatic deployment scaling and Kubernetes job
orchestration.

Restore Workflow

When you run stackgraph restore, the CLI performs the following steps:

Backup Selection
- If --latest flag is specified: Automatically fetches the most recent backup from Minio
- If --archive flag is specified: Uses the explicitly provided archive name
User Confirmation (unless --yes is used)
- Warns that restore will PURGE all existing Stackgraph data
- Displays backup file and namespace
- Prompts for confirmation: Do you want to continue? (yes/no):
- Use --yes or -y flag to skip confirmation (useful for automation)
Scale Down Deployments
- Identifies and scales down affected deployments to zero replicas
- Waits for all pods to terminate gracefully
- Stores original replica counts for restoration
Create Kubernetes Resources
- ConfigMap: Contains the restore script and configuration
- Secret: Mounts Minio credentials for S3 access
- PersistentVolumeClaim (PVC): Temporary storage for backup data
- Job: Executes the restore operation in a pod with FORCE_DELETE="-force" environment variable
Job Execution (conditional)
- If --background flag is NOT set:
- Waits for the Kubernetes Job to complete
- Streams job logs to stdout in real-time
- Reports success/failure status
  - If --background flag IS set:
  - Creates the job and returns immediately
- User can monitor job status separately with kubectl
Scale Up Deployments
- Restores deployments to their original replica counts
- Automatically triggered after job completion (or immediately if --background is used)
Cleanup
- Kubernetes Job is automatically cleaned up via TTL (10 minutes after completion)
- PVC remains for troubleshooting and is cleaned up in the next restore

Usage

Usage:
sts-backup stackgraph restore [flags]

Flags:
--archive string Specific archive name to restore (e.g., sts-backup-20210216-0300.graph)
--background Run restore job in background without waiting for completion
-h, --help help for restore
--latest Restore from the most recent backup
-y, --yes Skip confirmation prompt

Example 1: Restore with Background Execution (Interactive).

go run main.go stackgraph restore --namespace stac-23374-nonha --archive sts-backup-20251029-0300.graph --background

Warning: WARNING: Restoring from backup will PURGE all existing Stackgraph data!
Warning: This operation cannot be undone.

Backup to restore: sts-backup-20251029-0300.graph
Namespace: stac-23374-nonha

Do you want to continue? (yes/no): yes

Scaling down deployments (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled down 1 deployment(s):
  - suse-observability-server (replicas: 0 -> 0)
Waiting for pods to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-backup-20251029-0300.graph
✓ Restore job created: stackgraph-restore-20251029t142254

Job is running in background: stackgraph-restore-20251029t142254

Monitoring commands:
  kubectl logs --follow job/stackgraph-restore-20251029t142254 -n stac-23374-nonha
  kubectl get job stackgraph-restore-20251029t142254 -n stac-23374-nonha

To wait for completion, scaling up the necessary deployments and cleanup, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 --wait -n stac-23374-nonha

Example 2: Restore Latest Backup with automatic approval

❯ go run main.go stackgraph restore --namespace stac-23374-ha --latest --yes
Finding latest backup...
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-ha...
✓ Port-forward established successfully
Using latest backup: sts-backup-20251028-1535.graph

Scaling down deployments (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled down 9 deployment(s):
  - suse-observability-api (replicas: 1 -> 0)
  - suse-observability-authorization-sync (replicas: 1 -> 0)
  - suse-observability-checks (replicas: 1 -> 0)
  - suse-observability-health-sync (replicas: 1 -> 0)
  - suse-observability-initializer (replicas: 1 -> 0)
  - suse-observability-notification (replicas: 1 -> 0)
  - suse-observability-slicing (replicas: 1 -> 0)
  - suse-observability-state (replicas: 1 -> 0)
  - suse-observability-sync (replicas: 1 -> 0)
Waiting for pods to terminate...
Waiting for 3 pod(s) to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-backup-20251028-1535.graph
✓ Restore job created: stackgraph-restore-20251029t145215

Waiting for restore job to complete (this may take several minutes)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t145215 --wait -n stac-23374-ha

✓ Restore completed successfully

Cleaning up job and PVC...
✓ Job deleted: stackgraph-restore-20251029t145215
✓ PVC deleted: stackgraph-restore-20251029t145215

Scaling up deployments from annotations (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled up 9 deployment(s) successfully:
  - suse-observability-api (replicas: 0 -> 1)
  - suse-observability-authorization-sync (replicas: 0 -> 1)
  - suse-observability-checks (replicas: 0 -> 1)
  - suse-observability-health-sync (replicas: 0 -> 1)
  - suse-observability-initializer (replicas: 0 -> 1)
  - suse-observability-notification (replicas: 0 -> 1)
  - suse-observability-slicing (replicas: 0 -> 1)
  - suse-observability-state (replicas: 0 -> 1)
  - suse-observability-sync (replicas: 0 -> 1)

`stackgraph check-and-finalize`

Check the status of a background Stackgraph restore job and clean up resources.

Usage
sts-backup stackgraph check-and-finalize --job [--wait] -n

Flags:

--job, -j - Stackgraph restore job name (required)
--wait, -w - Wait for job to complete before cleanup

Note: This command automatically scales up deployments that were scaled down during restore.

Example: Checking if the job is still running

❯ go run main.go stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 -n stac-23374-nonha
Checking status of job: stackgraph-restore-20251029t142254

Job is running in background: stackgraph-restore-20251029t142254
  Active pods: 1

Monitoring commands:
  kubectl logs --follow job/stackgraph-restore-20251029t142254 -n stac-23374-nonha
  kubectl get job stackgraph-restore-20251029t142254 -n stac-23374-nonha

To wait for completion, scaling up the necessary deployments and cleanup, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t142254 --wait -n stac-23374-nonha

Example: Waiting for the job to finish

❯ go run main.go stackgraph check-and-finalize --job stackgraph-restore-20251029t162400 --wait -n stac-23374-nonha
Checking status of job: stackgraph-restore-20251029t162400

Waiting for restore job to complete (this may take several minutes)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup stackgraph check-and-finalize --job stackgraph-restore-20251029t162400 --wait -n stac-23374-nonha

✓ Job completed successfully: stackgraph-restore-20251029t162400

Scaling up deployments from annotations (selector: stackstate.com/connects-to-stackgraph=true)...
✓ Scaled up 1 deployment(s) successfully:
  - suse-observability-server (replicas: 0 -> 1)

Cleaning up job and PVC...
✓ Job deleted: stackgraph-restore-20251029t162400
✓ PVC deleted: stackgraph-restore-20251029t162400

internal/foundation/config/config.go

cmd/stackgraph/restore.go

internal/clients/s3/filter.go

…graph.multipartArchive

internal/app/app.go

STAC-23598: Restoring Stackgraph

64d2d47

viliakov force-pushed the STAC-23598 branch from 08fd655 to 64d2d47 Compare October 24, 2025 12:18