-
Notifications
You must be signed in to change notification settings - Fork 21
RFC#0192 - Ensure workers do not get unnecessarily killed #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed | ||
| * Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192) | ||
| * Proposed by: [@JohanLorenzo](https://github.com/JohanLorenzo) | ||
|
|
||
| # Summary | ||
|
|
||
| Optimize worker pools with `minCapacity >= 1` by implementing minimum capacity workers that avoid unnecessary shutdown/restart cycles, preserving caches and reducing task wait times. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Currently, workers in pools with `minCapacity >= 1` exhibit wasteful behavior: | ||
|
|
||
| 1. **Cache Loss**: Workers shut down after idle timeout (600 seconds for decision pools), losing valuable caches: | ||
| - VCS repositories and history | ||
| - Package manager caches (npm, pip, cargo, etc.) | ||
| - Container images and layers | ||
|
|
||
| 2. **Provisioning Delays**: New worker provisioning [takes ~75 seconds average for decision pools](https://taskcluster.github.io/mozilla-history/worker-metrics), during which tasks must wait | ||
|
|
||
| 3. **Resource Waste**: The current cycle of shutdown → detection → spawn → provision → register wastes compute resources and increases task latency | ||
|
|
||
| 4. **Violation of `minCapacity` Intent**: `minCapacity >= 1` suggests these pools should always have capacity available, but the current implementation allows temporary capacity gaps | ||
|
|
||
| # Details | ||
|
|
||
| ## Current Behavior Analysis | ||
|
|
||
| **Affected Worker Pools:** | ||
| - Direct `minCapacity: 1`: [`infra/build-decision`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L220), [`code-review/bot-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L3320) | ||
| - Keyed `minCapacity: 1`: [`gecko-1/decision-gcp`, `gecko-3/decision-gcp`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L976), and [pools matching `(app-services|glean|mozillavpn|mobile|mozilla|translations)-1`](https://github.com/mozilla-releng/fxci-config/blob/43c18aab0826244e369b16a964637b6c411c7760/worker-pools.yml#L1088) | ||
|
|
||
| **Current Implementation Issues:** | ||
| - Worker-manager enforces `minCapacity` by spawning new workers when capacity drops below threshold | ||
| - Generic-worker shuts down after `idleTimeoutSecs` regardless of `minCapacity` requirements | ||
| - Gap exists between worker shutdown and replacement detection/provisioning | ||
lotas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Proposed Solution | ||
|
|
||
| ### Core Concept: `minCapacity` Workers | ||
|
|
||
| Workers fulfilling `minCapacity >= 1` requirements should never self-terminate. This is achieved through a two-phase implementation: Phase 1 prevents minCapacity workers from self-terminating, and Phase 2 makes worker-manager the central authority for all worker termination decisions. | ||
|
|
||
| ### Phase 1: Prevent MinCapacity Worker Self-Termination | ||
|
|
||
| #### 1. Automatic Activation | ||
|
|
||
| **Trigger:** Automatically enabled when `minCapacity >= 1` (no configuration flag needed) | ||
|
|
||
| Workers spawned when `runningCapacity < minCapacity` receive `idleTimeoutSecs=0` and never self-terminate. | ||
|
|
||
| #### 2. Worker Config Injection | ||
|
|
||
| Worker-manager sets `idleTimeoutSecs=0` when spawning minCapacity workers. This makes [generic-worker never terminate by itself](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542). | ||
|
|
||
| #### 3. Capacity Management | ||
|
|
||
| **Removing Excess Capacity:** | ||
|
|
||
| When `runningCapacity > minCapacity`, [worker-manager scanner identifies](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) and terminates excess workers. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As it currently stands, worker-scanner doesn't remove workers when capacity changes, it only updates db state. Are we talking here about the case when |
||
|
|
||
| **Termination logic:** | ||
| - Query [Queue API client](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/main.js#L176-L178) to check if worker's latest task is running | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will likely not guarantee anything (workers currently polling for work), but would increase load on the queue service api. Assuming we want to implement this active workers termination routines already, I would propose to keep it really simple, and only select the desired amount of oldest workers to be terminated |
||
| - Select oldest idle workers first (by `worker.created` timestamp). | ||
| - Use the existing `removeWorker()` methods to terminate worker ([Google `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/google.js#L168-L192), [AWS `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/aws.js#L382-L418), [Azure `removeWorker()`](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/providers/azure/index.js#L1221-L1233)) | ||
|
|
||
|
|
||
| ## Error Handling and Edge Cases | ||
|
|
||
| **Worker Lifecycle Management:** | ||
| - **Pool reconfiguration**: Capacity changes trigger worker replacement, not reconfiguration | ||
lotas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Graceful transitions**: When possible, only terminate idle workers to preserve active caches | ||
| - **Resource allocation**: minCapacity workers mixed with other workers on same infrastructure | ||
|
|
||
| **Launch Configuration Changes:** | ||
| When a launch configuration is changed, removed, or archived, all workers created from the old configuration must be terminated and replaced: | ||
| - If a launch configuration is archived (not present in new configuration), identify all long-running workers created from it | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is actually a good point, and probably should be solved separately. Launch configs will be archived if something changes, and it's a good question if they should be killed immediately or not. Because if there was a minor change, like zone removed, do we want to let it work until it idles and shuts down, or we want to kill all such workers right away |
||
| - Terminate these workers via cloud provider APIs after checking for running tasks | ||
|
Comment on lines
+75
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does that work? Do we have a way to signal to the worker that it shouldn't claim any more tasks? Or signal to the queue that it shouldn't give it any? Or do we risk killing a running task?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Worker-manager is able to terminate cloud instances (e.g.) In its current state, the RFC risks killing a running task. In a previous revision, we thought of quarantining the worker. However @petemoore raised the concern that it would bend the purpose of the quarantine mechanism.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I think what we are missing here is the guarantee that worker is not running any tasks and the queue will not give out new tasks. Could be a short-lived queue (~20min same as claim timeout) |
||
| - Worker-manager will spawn new workers using the updated launch configuration | ||
| - This ensures workers always run with current configuration and prevents indefinite use of outdated configurations | ||
|
|
||
| ## Compatibility Considerations | ||
|
|
||
| - Automatic activation when `minCapacity >= 1` (no opt-in flag needed) | ||
| - Existing pools continue current behavior; minCapacity workers automatically stop self-terminating | ||
| - No changes to generic-worker's idle timeout mechanism | ||
|
|
||
| ### Phase 2: Centralized Termination Authority | ||
|
|
||
| **Goal:** Make worker-manager the sole authority for all worker termination decisions by removing worker self-termination entirely. | ||
|
|
||
| #### Implementation Changes | ||
|
|
||
| **1. Remove Worker Idle Timeout Code** | ||
|
|
||
| Remove [idle timeout mechanism](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/workers/generic-worker/main.go#L536-L542) from generic-worker: | ||
|
|
||
| Workers run indefinitely until worker-manager terminates them. This mean, worker-managers stops sending `idleTimeoutSecs` to workers at spawn time. | ||
|
|
||
| **2. Centralized Idle Enforcement** | ||
|
|
||
| Worker-manager enforces idle timeout using existing [`queueInactivityTimeout` from lifecycle configuration](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/schemas/v1/worker-lifecycle.yml#L33-L50) and through idleTimeout (as specified in phase 1). | ||
|
|
||
| [Scanner polls](https://github.com/taskcluster/taskcluster/blob/754938c53ba34aea5a50ce610272e7a275c11911/services/worker-manager/src/worker-scanner.js#L69-L76) Queue API to track worker idle time: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we expect this to scale well? (We already have issues with scanner loops taking too long.)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lotas, what do you think?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, we probably don't want to kill queue api with those increased loads of queries. Instead, I think we should build some state inside w-m service.
this way we'll simply leverage existing rabbitmq exchanges, and would know the approximate state of the worker without ddosing queue service |
||
| - Get latest task from `worker.recentTasks` | ||
| - Call `queue.status(taskId)` to check if task is running | ||
| - Calculate idle time from task's `resolved` timestamp | ||
| - Terminate when idle time exceeds `queueInactivityTimeout` | ||
|
|
||
| **3. Termination Decision Factors** | ||
|
|
||
| Worker-manager terminates workers when: | ||
| - Idle timeout exceeded (`queueInactivityTimeout`) | ||
| - Capacity exceeds `maxCapacity` | ||
| - Capacity exceeds `minCapacity` (terminate oldest first) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait, wouldn't it contradict the whole point of min capacity workers? If we'll always be killing oldest, those with caches would be gone first |
||
| - Launch configuration changed/archived | ||
| - Worker is unhealthy (provider-specific check) | ||
|
|
||
| All terminations check for running tasks before proceeding. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As in phase 1, care will be needed here to avoid races between termination and claimWork, it would be good to flesh out that process more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jcristau What would you recommend? |
||
|
|
||
| **Migration:** | ||
| Deploy as breaking change requiring simultaneous worker-manager and generic-worker updates. | ||
|
Comment on lines
+120
to
+121
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like a fairly risky strategy, best avoided...
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @petemoore, could we have a way for generic-worker to support both idle-termination and termination from worker-manager for a transition period? |
||
|
|
||
| # Implementation | ||
|
|
||
| <Once the RFC is decided, these links will provide readers a way to track the | ||
| implementation through to completion, and to know if they are running a new | ||
| enough version to take advantage of this change. It's fine to update this | ||
| section using short PRs or pushing directly to master after the RFC is | ||
| decided> | ||
|
|
||
| * <link to tracker bug, issue, etc.> | ||
| * <...> | ||
| * Implemented in Taskcluster version ... | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the average is not the issue here, it's that whenever something goes slightly wrong, or it's a slightly less often used pool, it can take tens of minutes.