Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 27, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Dec 2, 2025 7:12pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Dec 2, 2025 7:12pm
rivet-inspector Ignored Ignored Preview Dec 2, 2025 7:12pm
rivet-site Ignored Ignored Preview Dec 2, 2025 7:12pm

Copy link
Contributor Author

MasterPtato commented Nov 27, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 27, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3550

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3550

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3550

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3550

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3550

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3550

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3550

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3550

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3550

commit: 5439e53

@claude
Copy link

claude bot commented Nov 27, 2025

Code Review

Thank you for this PR! This adds important shutdown observability and graceful termination handling. Here is my feedback:

Strengths

  1. Good observability improvements - The periodic shutdown progress messages will help operators understand what is happening during deployments
  2. Consistent pattern - You applied the same monitoring pattern across worker.rs and server.rs
  3. Proper WebSocket signaling - The new GoingAway error with structured error handling follows the existing pattern nicely
  4. Atomic operations - Correct use of Ordering::Acquire and Ordering::Release for reading/writing the hyper_shutdown flag

Issues and Suggestions

1. Missing SIGTERM handling in tunnel_to_ws_task.rs

The ws_to_tunnel_task.rs now handles SIGTERM and sends the GoingAway error, but tunnel_to_ws_task.rs does not have the same logic. This creates an asymmetry in how the two directions handle graceful shutdown.

Suggestion: Add the same SIGTERM handling to tunnel_to_ws_task.rs for consistency. The runner should evict gracefully from both directions.

2. Missing newline in JSON error file

The file engine/artifacts/errors/ws.going_away.json is missing a trailing newline, which violates common conventions and may cause issues with some tools.

Suggestion: Add a newline at the end of the JSON file.

3. Memory ordering is correct

You are using Ordering::Release for the store and Ordering::Acquire for the load on hyper_shutdown. This is correct for ensuring visibility of the write to readers. Good job!

4. Consider extracting the shutdown interval constant

Both gasoline/worker.rs and guard-core/server.rs define SHUTDOWN_PROGRESS_INTERVAL as Duration::from_secs(7).

Suggestion: If you expect this to be consistent across services, consider extracting it to a shared configuration or runtime module.

Test Coverage

I did not see any tests added for the new shutdown progress logging, the GoingAway error path in websocket handling, or the remaining_tasks method.

Suggestion: Consider adding integration tests that verify SIGTERM triggers graceful shutdown with progress messages and WebSocket connections receive the proper GoingAway close code.

Minor Observations

  1. Log message consistency: You changed requests to tasks in guard shutdown logs, which is more accurate given you are tracking task groups. Nice improvement!

  2. Interval tick pattern: The pattern of calling progress_interval.tick().await immediately after creating the interval is correct - it consumes the first immediate tick so the first real tick happens after the interval duration.

Priority Fixes

  1. High: Add SIGTERM handling to tunnel_to_ws_task.rs for consistency
  2. Low: Add trailing newline to ws.going_away.json
  3. Optional: Extract SHUTDOWN_PROGRESS_INTERVAL to shared constant

Overall, this is a solid improvement to shutdown observability! The main concern is the missing SIGTERM handling in the tunnel-to-ws direction.


Review generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-26-fix_runtime_evict_runners_on_sigterm_add_periodic_shutdown_progress_messages branch from 5d5591a to ef06c0e Compare November 27, 2025 00:52
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 27, 2025 00:52 Inactive
@claude
Copy link

claude bot commented Nov 27, 2025

PR Review: fix(runtime): evict runners on SIGTERM, add periodic shutdown progress messages

This PR improves graceful shutdown behavior by evicting runners when receiving SIGTERM signals and adding periodic progress logging during shutdown. Overall, the implementation is solid and follows good practices. Here are my detailed findings:


Strengths

  1. Consistent shutdown progress reporting: The 7-second interval (SHUTDOWN_PROGRESS_INTERVAL) is used consistently across all three modified components (gasoline worker, guard server, and pegboard runner).

  2. Good observability: The progress messages provide valuable insight into shutdown status with structured logging fields like remaining_workflows, remaining_tasks, and hyper_shutdown.

  3. Proper use of atomic operations: The AtomicBool for hyper_shutdown tracking uses appropriate memory ordering (Release/Acquire).

  4. Clean error handling: The new WsError::GoingAway error follows the repository's custom error pattern correctly.

  5. Non-breaking changes: The additions are backward-compatible and don't modify existing behavior significantly.


🔍 Code Quality Observations

guard-core/src/server.rs (lines 255-311)

Good:

  • The hyper_shutdown flag provides clear visibility into the two-phase shutdown (hyper graceful shutdown → task cleanup)
  • Cloning factories and using separate variables (http_factory2, https_factory2) for the async block is clean

Minor suggestion:

  • Line 261: The initial log shows hyper_shutdown=%false which is always false at startup. Consider whether this adds value, or if it's only useful in the progress logs (line 304).

gasoline/src/worker.rs (lines 301-314)

Good:

  • The progress interval is correctly initialized with an immediate tick to avoid logging before the first interval
  • The message "worker still shutting down" with remaining_workflows is helpful

Observation:

  • The shutdown logic is well-structured with clear phases

pegboard-runner/src/ws_to_tunnel_task.rs (lines 57-106)

Good:

  • Adding term_signal to the recv_msg function is the right approach
  • Returning WsError::GoingAway error on SIGTERM is semantically correct

Note:

  • The error propagation chain is correct - the error will bubble up and close the WebSocket with the appropriate error message

🐛 Potential Issues

1. Memory Ordering Consistency (Minor)

In guard-core/src/server.rs:

  • Line 272: hyper_shutdown2.store(true, Ordering::Release)
  • Line 302: hyper_shutdown.load(Ordering::Acquire)

This is correct, but for documentation: Release/Acquire pairing ensures proper synchronization. Good job here.

2. Task Group Counting (Observation)

In task_group.rs:58-60, the new remaining_tasks() method uses Ordering::Acquire. This is consistent with wait_idle() at line 52, which is good. However, note that there's a potential race condition where:

  • The count could change between when you read it and when you log it
  • This is acceptable for logging/observability purposes, but worth documenting

Suggestion: Add a comment in task_group.rs that the count is eventually consistent and suitable for observability but not for synchronization.

3. Progress Interval Initialization Pattern

All three files use:

let mut progress_interval = tokio::time::interval(SHUTDOWN_PROGRESS_INTERVAL);
progress_interval.tick().await;

The immediate .tick().await consumes the first tick. This is correct to prevent immediate logging, but it means the first progress message appears after 7 seconds, not immediately. This is likely intentional, but worth confirming the behavior aligns with expectations.


🔒 Security Considerations

No security concerns identified. The changes are focused on observability and graceful shutdown, which actually improve reliability.


Performance Considerations

  1. Minimal overhead: The 7-second progress interval is reasonable and won't cause performance issues
  2. Atomic operations: The use of atomics in task_group.rs is efficient
  3. No blocking operations: All shutdown logic uses async/await properly

🧪 Testing Recommendations

Consider adding tests or manual verification for:

  1. Shutdown progress messages: Verify logs appear every 7 seconds during a long shutdown
  2. WebSocket error handling: Ensure clients receive the GoingAway error and can reconnect
  3. Race conditions: Test rapid SIGTERM signals (first graceful, second abort)
  4. Timeout behavior: Verify that shutdown times out correctly if tasks don't complete

📝 Minor Style/Convention Notes

Per CLAUDE.md conventions:

  1. Logging: Lowercase messages are used correctly (e.g., "worker still shutting down")
  2. Structured logging: Uses tracing's structured fields (%remaining_tasks, %remaining_workflows)
  3. Imports: All imports are at the top of files (no inline imports)
  4. Error handling: Custom error types follow the repository pattern

🎯 Suggestions for Improvement

Optional Enhancement 1: Shutdown Metrics

Consider emitting metrics for:

  • Shutdown duration
  • Number of tasks/workflows remaining at timeout
  • Whether shutdown completed gracefully vs. aborted

Example in guard-core/src/server.rs after line 311:

let shutdown_elapsed = shutdown_start.elapsed();
metrics::SHUTDOWN_DURATION.record(shutdown_elapsed.as_secs_f64(), &[]);

Optional Enhancement 2: Configurable Progress Interval

The 7-second interval is hardcoded. Consider making it configurable in rivet_config::Config if users might want different verbosity levels.


Summary

This is a well-implemented PR that improves observability during shutdown. The code quality is high, follows repository conventions, and adds valuable operational visibility. The main suggestions are minor enhancements rather than fixes for bugs.

Recommendation: ✅ Approve with minor suggestions

Great work on improving the graceful shutdown experience!

@MasterPtato MasterPtato force-pushed the 11-26-fix_runtime_evict_runners_on_sigterm_add_periodic_shutdown_progress_messages branch from ef06c0e to f5a5102 Compare November 27, 2025 00:58
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 27, 2025 00:58 Inactive
@graphite-app
Copy link
Contributor

graphite-app bot commented Dec 3, 2025

Merge activity

  • Dec 3, 12:33 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Dec 3, 12:34 AM UTC: CI is running for this pull request on a draft pull request (#3577) due to your merge queue CI optimization settings.
  • Dec 3, 12:34 AM UTC: Merged by the Graphite merge queue via draft PR: #3577.

graphite-app bot pushed a commit that referenced this pull request Dec 3, 2025
@graphite-app graphite-app bot closed this Dec 3, 2025
@graphite-app graphite-app bot deleted the 11-26-fix_runtime_evict_runners_on_sigterm_add_periodic_shutdown_progress_messages branch December 3, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants