Skip to content

Conversation

@mrrobot47
Copy link
Member

@mrrobot47 mrrobot47 commented Dec 26, 2025

Summary

  • Add flock-based global lock to serialize concurrent backups, preventing OOM crashes when multiple backups are triggered simultaneously (e.g., from EasyEngine Dashboard)
  • Add escapeshellarg() to all rclone shell commands to prevent shell injection vulnerabilities

Problem

When EasyEngine Dashboard triggers ee site backup for multiple sites simultaneously, each backup process independently checks available RAM and allocates memory buffers. With 5+ concurrent backups on an 8GB server, total memory allocation exceeds available RAM, causing the OOM killer to terminate processes (exit code 137).

Solution

Implement a global backup lock using flock():

  • Only one backup runs at a time; others wait (polling every 60 seconds)
  • Maximum wait time: 24 hours (configurable via constant)
  • Lock auto-releases on process death (kernel handles this)
  • Shutdown handler ensures cleanup on errors

Lock file location: EE_BACKUP_DIR/backup-global.lock

Error Codes

  • 5002: Cannot create backup lock file
  • 5003: Timeout waiting for another backup to complete

Notes

  • flock() may not work reliably on NFS or other network filesystems; EE_BACKUP_DIR should be on a local filesystem
  • The --list flag returns early without acquiring the lock (read-only operation)

Add proper shell argument escaping to prevent command injection in rclone operations:

- rclone size --json (line 995)
- rclone lsf --dirs-only (line 1172)
- rclone copy in rclone_download() (line 1251)
- rclone copy in rclone_upload() (line 1280)
- rclone lsf after upload (line 1292)

Note: rclone purge commands in cleanup_old_backups() and
rollback_failed_backup() were already properly escaped.

This is a defense-in-depth measure as paths are derived from
site_url in the database, but protects against potential
command injection if malicious data were to be inserted.
…kups

Implement flock()-based serialization to ensure only one backup runs at a time.
This prevents OOM killer (exit code 137) when multiple backups are triggered
simultaneously by EasyDash, as each backup calculates available RAM independently.

Problem:
When 5 backups trigger at once, each checks available RAM (e.g., 8GB) and
allocates buffers accordingly. Total memory = 5 × 8GB = 40GB → OOM killer.

Solution:
Global lock ensures backups run sequentially. Each backup gets accurate
RAM reading and full system resources.

Implementation details:
- Uses PHP flock() for atomic, race-condition-free locking
- Waiting backups poll every 30 seconds with status messages
- Maximum wait time of 2 hours before timeout (error code 5001)
- Lock file stores current backup site URL and PID for debugging
- Shutdown handler ensures lock release on any exit (error, crash, kill)
- release_global_backup_lock() is idempotent (safe to call multiple times)

Lock file location: EE_ROOT_DIR/services/backup-global.lock

Error codes:
- 5001: Timeout waiting for another backup to complete
- 5002: Cannot create backup lock file
Move backup-global.lock from EE_ROOT_DIR/services/ to EE_BACKUP_DIR/
to keep all backup-related files in one location.

Lock file location: EE_BACKUP_DIR/backup-global.lock
Copilot AI review requested due to automatic review settings December 26, 2025 09:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a global file-based locking mechanism to serialize concurrent site backups and prevent OOM crashes, while also improving security by escaping shell arguments in rclone commands.

Key Changes:

  • Implements flock-based global backup lock with configurable 24-hour timeout and 60-second polling interval
  • Adds escapeshellarg() to all rclone command invocations to prevent shell injection vulnerabilities
  • Introduces new error codes (5001, 5002) and ERROR_TYPE_LOCK constant for lock-related failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Error code 5001 was already used for PHP Fatal Errors in
dash_shutdown_handler. Changed lock timeout to 5003 to avoid collision.
@mrrobot47 mrrobot47 merged commit 877a78c into EasyEngine:develop Dec 26, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant