-
Notifications
You must be signed in to change notification settings - Fork 28
feat(backup): add global lock to prevent concurrent backup OOM crashes #474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add proper shell argument escaping to prevent command injection in rclone operations: - rclone size --json (line 995) - rclone lsf --dirs-only (line 1172) - rclone copy in rclone_download() (line 1251) - rclone copy in rclone_upload() (line 1280) - rclone lsf after upload (line 1292) Note: rclone purge commands in cleanup_old_backups() and rollback_failed_backup() were already properly escaped. This is a defense-in-depth measure as paths are derived from site_url in the database, but protects against potential command injection if malicious data were to be inserted.
…kups Implement flock()-based serialization to ensure only one backup runs at a time. This prevents OOM killer (exit code 137) when multiple backups are triggered simultaneously by EasyDash, as each backup calculates available RAM independently. Problem: When 5 backups trigger at once, each checks available RAM (e.g., 8GB) and allocates buffers accordingly. Total memory = 5 × 8GB = 40GB → OOM killer. Solution: Global lock ensures backups run sequentially. Each backup gets accurate RAM reading and full system resources. Implementation details: - Uses PHP flock() for atomic, race-condition-free locking - Waiting backups poll every 30 seconds with status messages - Maximum wait time of 2 hours before timeout (error code 5001) - Lock file stores current backup site URL and PID for debugging - Shutdown handler ensures lock release on any exit (error, crash, kill) - release_global_backup_lock() is idempotent (safe to call multiple times) Lock file location: EE_ROOT_DIR/services/backup-global.lock Error codes: - 5001: Timeout waiting for another backup to complete - 5002: Cannot create backup lock file
Move backup-global.lock from EE_ROOT_DIR/services/ to EE_BACKUP_DIR/ to keep all backup-related files in one location. Lock file location: EE_BACKUP_DIR/backup-global.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a global file-based locking mechanism to serialize concurrent site backups and prevent OOM crashes, while also improving security by escaping shell arguments in rclone commands.
Key Changes:
- Implements flock-based global backup lock with configurable 24-hour timeout and 60-second polling interval
- Adds escapeshellarg() to all rclone command invocations to prevent shell injection vulnerabilities
- Introduces new error codes (5001, 5002) and ERROR_TYPE_LOCK constant for lock-related failures
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Error code 5001 was already used for PHP Fatal Errors in dash_shutdown_handler. Changed lock timeout to 5003 to avoid collision.
Summary
escapeshellarg()to all rclone shell commands to prevent shell injection vulnerabilitiesProblem
When EasyEngine Dashboard triggers
ee site backupfor multiple sites simultaneously, each backup process independently checks available RAM and allocates memory buffers. With 5+ concurrent backups on an 8GB server, total memory allocation exceeds available RAM, causing the OOM killer to terminate processes (exit code 137).Solution
Implement a global backup lock using
flock():Lock file location:
EE_BACKUP_DIR/backup-global.lockError Codes
5002: Cannot create backup lock file5003: Timeout waiting for another backup to completeNotes
flock()may not work reliably on NFS or other network filesystems;EE_BACKUP_DIRshould be on a local filesystem--listflag returns early without acquiring the lock (read-only operation)