Skip to content

Conversation

@viliakov
Copy link
Contributor

@viliakov viliakov commented Nov 4, 2025

This PR adds VictoriaMetrics backup/restore functionality and introduces a new
internal/orchestration/restore/ package that eliminates code duplication between Stackgraph,
VictoriaMetrics and potentially soon coming Clickhouse and Configuration restore operations.

Architecture Improvements

Code Deduplication

This PR extracts common restore patterns into a reusable orchestration
layer:

  • internal/orchestration/restore/: New package containing shared restore operations
    • confirmation.go: User confirmation prompts
    • job.go: Kubernetes Job lifecycle management (wait, monitor, logs)
    • finalize.go: Background job status checking and cleanup orchestration
    • resources.go: ConfigMap and Secret resource management

StatefulSet Scaling Support

Extended the internal/orchestration/scale/ package to support both Deployments and StatefulSets
through a unified interface, enabling VictoriaMetrics StatefulSet scaling during restore
operations.

📖 Updated architecture documentation in ARCHITECTURE.md and
README.md

New Commands

victoriametrics list

Lists available VictoriaMetrics backups from Minio S3 storage.

Examples:

List backups for a single-node VM setup

❯ go run main.go victoria-metrics list --namespace stac-23374-nonha
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-nonha...
✓ Port-forward established successfully
Listing VictoriaMetrics backups in bucket ...

NAME ({bucket}/{instance}-{created})                           UPDATED
sts-victoria-metrics-backup/victoria-metrics-0-20251030152500  2025-11-04 10:25:07 UTC

List backups for a HA VM setup (mirroring by vmagent)

❯ go run main.go victoria-metrics list --namespace stac-23374-ha
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-ha...
✓ Port-forward established successfully
Listing VictoriaMetrics backups in bucket ...

NOTE: In HA mode, backups from both instances (victoria-metrics-0 and victoria-metrics-1) are listed.
      The restore command accepts either backup to restore both instances.

NAME ({bucket}/{instance}-{created})                           UPDATED
sts-victoria-metrics-backup/victoria-metrics-1-20251030152500  2025-11-04 13:35:02 UTC
sts-victoria-metrics-backup/victoria-metrics-0-20251030152500  2025-11-04 13:25:01 UTC
sts-victoria-metrics-backup/victoria-metrics-1-20251030143500  2025-11-04 09:35:02 UTC
sts-victoria-metrics-backup/victoria-metrics-0-20251030143500  2025-11-04 09:25:02 UTC

victoriametrics restore

Restores VictoriaMetrics from a backup archive with automatic StatefulSet scaling and Kubernetes
job orchestration.

Restore Workflow

  1. Backup Selection
    - --latest flag: Automatically fetches the most recent backup
    - --archive flag: Uses the explicitly provided backup name
  2. User Confirmation (unless --yes is used)
    - Warns that restore will PURGE all existing VictoriaMetrics data
    - Displays backup file and namespace
    - Prompts: Do you want to continue? (yes/no):
  3. Scale Down StatefulSets
    - Scales down affected StatefulSets to zero replicas
    - Waits for all pods to terminate gracefully
    - Stores original replica counts in annotations
  4. Create Kubernetes Resources
    - ConfigMap: Contains restore script
    - Secret: Mounts Minio credentials
    - Job: Executes restore in containers (one per HA instance)
  5. Job Execution
    - Without --background: Waits for completion and streams logs
    - With --background: Returns immediately for monitoring separately
  6. Scale Up StatefulSets
    - Restores StatefulSets to original replica counts
    - Triggered after job completion (or immediately with --background)
  7. Cleanup
    - Job is automatically cleaned up via TTL (24 hours after completion)

Usage:
sts-backup victoriametrics restore [flags]

Flags:
--archive string Specific backup name to restore (e.g.,
sts-victoria-metrics-backup/victoria-metrics-0-20251030143500)
--background Run restore job in background without waiting for completion
--latest Restore from the most recent backup
-y, --yes Skip confirmation prompt

Example 1: Restore Latest Backup
sts-backup victoriametrics restore --namespace --latest --yes

Example 2: Restore Specific Backup in Background
sts-backup victoriametrics restore --namespace
--archive sts-victoria-metrics-backup/victoria-metrics-0-20251030143500
--background

Examples:
Restoring the latest available VM backup with auto-confirmation

❯ go run main.go victoria-metrics restore --namespace stac-23374-nonha --latest --yes
Finding latest backup...
Setting up port-forward to suse-observability-minio:9000 in namespace stac-23374-nonha...
✓ Port-forward established successfully
Listing VictoriaMetrics backups in bucket ...
Using latest backup: sts-victoria-metrics-backup/victoria-metrics-0-20251030152500

Scaling down deployments (selector: observability.suse.com/scalable-during-vm-restore=true)...
✓ Scaled down 0 deployment(s):
✓ Scaled down 1 statefulsets(s):
  - suse-observability-victoria-metrics-0 (replicas: 0 -> 0)
Waiting for pods to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-victoria-metrics-backup/victoria-metrics-0-20251030152500
✓ Restore job created: victoriametrics-restore-20251104t143638

Waiting for restore job to complete (this may take significant amount of time depending on the archive size)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup victoria-metrics check-and-finalize --job victoriametrics-restore-20251104t143638 --wait -n stac-23374-nonha

✓ Restore completed successfully

Cleaning up resources...
✓ Job deleted: victoriametrics-restore-20251104t143638

Scaling up deployments from annotations (selector: observability.suse.com/scalable-during-vm-restore=true)...
✓ Scaled up 0 deployment(s) successfully:
✓ Scaled up 1 statefulset(s) successfully:
  - suse-observability-victoria-metrics-0 (replicas: 0 -> 1)

Running restore operation in background

❯ go run main.go victoria-metrics restore --namespace stac-23374-ha --archive sts-victoria-metrics-backup/victoria-metrics-1-20251030152500 --background

Warning: WARNING: Restoring from backup will PURGE all existing VictoriaMetrics data!
Warning: This operation cannot be undone.

Backup to restore: sts-victoria-metrics-backup/victoria-metrics-1-20251030152500
Namespace: stac-23374-ha

Do you want to continue? (yes/no): yes

Scaling down deployments (selector: observability.suse.com/scalable-during-vm-restore=true)...
✓ Scaled down 0 deployment(s):
✓ Scaled down 2 statefulsets(s):
  - suse-observability-victoria-metrics-0 (replicas: 1 -> 0)
  - suse-observability-victoria-metrics-1 (replicas: 1 -> 0)
Waiting for pods to terminate...
✓ All pods have terminated

Ensuring backup scripts ConfigMap exists...
✓ Backup scripts ConfigMap ready
Ensuring Minio keys secret exists...
✓ Minio keys secret ready

Creating restore job for backup: sts-victoria-metrics-backup/victoria-metrics-1-20251030152500
✓ Restore job created: victoriametrics-restore-20251104t143755

Job is running in background: victoriametrics-restore-20251104t143755

Monitoring commands:
  kubectl logs --follow job/victoriametrics-restore-20251104t143755 -n stac-23374-ha
  kubectl get job victoriametrics-restore-20251104t143755 -n stac-23374-ha

To wait for completion, scaling up the necessary deployments and cleanup, run:
  sts-backup victoria-metrics check-and-finalize --job victoriametrics-restore-20251104t143755 --wait -n stac-23374-ha

victoriametrics check-and-finalize

Checks the status of a background VictoriaMetrics restore job and cleans up resources.

Usage:
sts-backup victoriametrics check-and-finalize --job [--wait] -n

Flags:
-j, --job string VictoriaMetrics restore job name (required)
-w, --wait Wait for job to complete before cleanup

Note: This command automatically scales up StatefulSets that were scaled down during restore.

Example: Check Job Status
sts-backup victoriametrics check-and-finalize
--job victoriametrics-restore-20251104t143000
-n

Example: Wait for Completion and Cleanup
sts-backup victoriametrics check-and-finalize
--job victoriametrics-restore-20251104t143000
--wait
-n

Examples:
Waiting for the restore job to complete and cleaning up, scaling after that

❯ go run main.go victoria-metrics check-and-finalize --job victoriametrics-restore-20251104t143930 --wait -n stac-23374-ha
Checking status of job: victoriametrics-restore-20251104t143930

Waiting for restore job to complete (this may take significant amount of time depending on the archive size)...

You can safely interrupt this command with Ctrl+C.
To check status, scale up the required deployments and cleanup later, run:
  sts-backup victoria-metrics check-and-finalize --job victoriametrics-restore-20251104t143930 --wait -n stac-23374-ha

✓ Job completed successfully: victoriametrics-restore-20251104t143930

Scaling up deployments from annotations (selector: observability.suse.com/scalable-during-vm-restore=true)...
✓ Scaled up 0 deployment(s) successfully:
✓ Scaled up 2 statefulset(s) successfully:
  - suse-observability-victoria-metrics-0 (replicas: 0 -> 1)
  - suse-observability-victoria-metrics-1 (replicas: 0 -> 1)

Cleaning up resources...
✓ Job deleted: victoriametrics-restore-20251104t143930

Stackgraph Updates

The stackgraph commands now also benefit from the shared orchestration layer. The
check-and-finalize command was refactored to use the same orchestration functions as
VictoriaMetrics, ensuring consistent behavior across services.

@viliakov viliakov merged commit 6d11f50 into main Nov 6, 2025
5 checks passed
@viliakov viliakov deleted the STAC-23599 branch November 6, 2025 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants