Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 3, 2025 00:16 Inactive
@vercel
Copy link

vercel bot commented Dec 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Dec 3, 2025 0:27am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Dec 3, 2025 0:27am
rivet-inspector Ignored Ignored Preview Dec 3, 2025 0:27am
rivet-site Ignored Ignored Preview Dec 3, 2025 0:27am

Copy link
Contributor Author

MasterPtato commented Dec 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 3, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3576

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3576

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3576

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3576

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3576

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3576

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3576

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3576

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3576

commit: 0e5dce8

@claude
Copy link

claude bot commented Dec 3, 2025

PR Review: fix(serverless): misc pr fixes

This PR makes several improvements to the serverless infrastructure, including fixing KV metadata naming, refactoring the actor event demuxer, and renaming components for clarity. Overall, the changes are solid with good improvements to code organization.

✅ Positive Changes

1. KV Metadata Field Name Fix (actor-kv, api-peer, pegboard-runner)

  • Correctly renames create_tsupdate_ts in KV metadata
  • This is a critical bug fix - the field was semantically incorrect
  • Protocol schema properly updated in v4.bare (line 19: updateTs: i64)
  • Removes misleading comment in api-peer/src/actors/kv_get.rs:62-64

2. Actor Event Demuxer Refactoring (pegboard-runner/src/actor_event_demuxer.rs)

  • Extracts inline closure to standalone channel_handler function (lines 104-122)
  • Improves code readability and testability
  • Correctly updates last_seen timestamp when ingesting events (line 42)
  • Good separation of concerns

3. Terminology Improvements

  • Renames serverless::poolrunner_pool (more accurate, not serverless-specific)
  • Renames serverless::runnerserverless::receiver (better describes role)
  • Renames serverless::connectionserverless::conn (more concise)
  • affects_autoscaler()affects_pool() (better naming)
  • serverless_runner_createdpool_created (clearer intent)

4. Comment Improvements

  • Clarifies "pending actors" comments (e.g., runner.rs:134, runner2.rs:55)
  • Adds context: "which happen when there is not enough runner capacity"

⚠️ Issues & Concerns

1. Potential Race Condition in runner_pool.rs

Lines 48-57: Draining outdated runners

for runner in outdated {
    // TODO: Spawn sub wf to process these so this is not blocking the loop
    ctx.signal(serverless::receiver::Drain {})
        .to_workflow_id(runner.receiver_wf_id)
        .send()
        .await?;
}

Problem: The loop blocks on each send().await sequentially. If you have many outdated runners, this could delay the workflow significantly.

Recommendation: Consider spawning concurrent tasks or using join_all:

let drain_futures = outdated.iter().map(|runner| {
    ctx.signal(serverless::receiver::Drain {})
        .to_workflow_id(runner.receiver_wf_id)
        .send()
});
futures::future::try_join_all(drain_futures).await?;

Same issue appears in lines 67-72 and 78-89.

2. Removed Completed Runner Cleanup (runner_pool.rs:44-48)

The PR removes this logic:

let completed_runners = ctx
    .activity(GetCompletedInput {
        runners: state.runners.iter().map(|r| r.runner_wf_id).collect(),
    })
    .await?;

state.runners.retain(|r| !completed_runners.contains(&r.runner_wf_id));

Question: How are completed receivers now removed from state.runners? The only removal I see is:

  1. When they're outdated (line 48)
  2. When OutboundConnDrainStarted signal is received (line 101)

If a receiver completes without sending OutboundConnDrainStarted, it may remain in the state indefinitely. Is this intentional? Does the signal always get sent on completion?

3. Potential Memory Leak in Event Demuxer (actor_event_demuxer.rs:40-42)

if let Some(channel) = self.channels.get_mut(&actor_id) {
    let _ = channel.tx.send(event);
    channel.last_seen = Instant::now();
}

Issue: If tx.send(event) fails (receiver dropped), the error is silently ignored with let _ =. The channel remains in the HashMap but is non-functional.

Recommendation: Check the send result and remove dead channels:

if let Some(channel) = self.channels.get_mut(&actor_id) {
    if channel.tx.send(event).is_err() {
        // Channel closed, will be cleaned up by GC
        tracing::warn!(?actor_id, "channel closed, event dropped");
    }
    channel.last_seen = Instant::now();
}

4. Database Query Optimization (runner_pool.rs:136-157)

The read_desired activity now runs two queries in parallel using tokio::try_join!:

  • op(runner_config::get)
  • UDB read for ServerlessDesiredSlotsKey

Good: Parallel execution reduces latency.

Concern: The UDB query returns desired_slots.unwrap_or_default() (line 155). If the key doesn't exist, it returns 0. Later logic checks desired_slots < 0 (line 176) - this condition will never trigger if the value is missing.

Recommendation: Consider distinguishing between "key missing" vs "value is 0" if this matters for your logic.

5. TypeScript SDK Breaking Change (runner-protocol/src/index.ts)

Impact: This is a breaking change for TypeScript SDK consumers. Any code using metadata.createTs will break.

Recommendation:

  • Document this in release notes
  • Consider deprecation path if this SDK is public-facing
  • Verify all TypeScript consumers are updated

6. Protocol Version Backward Compatibility

The v4.bare schema changes createTsupdateTs in KvMetadata. This affects runners using older protocol versions.

Question: Are there migration steps for runners on older protocol versions? The versioned conversion in runner-protocol/src/versioned.rs still uses create_ts (lines 1536, 1543) - is this intentional for backward compat?

🔍 Minor Issues

7. Incomplete TODO Comments

Multiple new TODO comments added:

  • runner_pool.rs:52: "Spawn sub wf to process these so this is not blocking the loop"
  • runner_pool.rs:67: Same TODO
  • runner_pool.rs:78: Same TODO

These indicate known performance issues but no timeline for resolution. Consider creating tracking issues.

8. Error Handling in conn.rs:338-346

The comment says "This is ok, because we only send DrainStarted once" but the code handles retry in conn.rs:101-108. The retry logic sends the signal again if !drain_sent. This seems correct, but the comment is misleading.

9. Inconsistent Signal Handling (runner_pool.rs:96-106)

for sig in ctx.listen_n::<Main>(1024).await? {
    match sig {
        Main::OutboundConnDrainStarted(sig) => {
            state
                .runners
                .retain(|r| r.receiver_wf_id != sig.receiver_wf_id);
        }
        Main::Bump(_) => {}
    }
}

Issue: listen_n(1024) processes up to 1024 signals. If you receive 1024 Bump signals, they're silently dropped (empty handler). Is this intentional?

Expected: Either handle Bump meaningfully or document why it's a no-op.

📋 Testing Recommendations

  1. Test KV metadata migration: Verify old data with create_ts is handled correctly
  2. Test runner pool scaling: Verify outdated runners are drained properly
  3. Test event demuxer: Verify channels are cleaned up when actors disappear
  4. Load test: Verify the TODOs about blocking loops don't cause issues under load
  5. Test receiver lifecycle: Verify all completed receivers are removed from state

🎯 Summary

Category Count
Critical Issues 2 (Runner cleanup removal, TypeScript breaking change)
Performance Concerns 3 (Blocking loops, signal batching)
Potential Bugs 2 (Event send error handling, missing slots key)
Code Quality ✅ Good refactoring and naming improvements

Overall Assessment: Good refactoring work with important bug fixes, but needs attention to the removed cleanup logic and potential performance bottlenecks. The blocking TODOs should be addressed before this ships to production.


Review generated by Claude Code at 2025-12-03

@graphite-app
Copy link
Contributor

graphite-app bot commented Dec 3, 2025

Merge activity

  • Dec 3, 12:33 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Dec 3, 12:34 AM UTC: CI is running for this pull request on a draft pull request (#3577) due to your merge queue CI optimization settings.
  • Dec 3, 12:34 AM UTC: Merged by the Graphite merge queue via draft PR: #3577.

graphite-app bot pushed a commit that referenced this pull request Dec 3, 2025
@graphite-app graphite-app bot closed this Dec 3, 2025
@graphite-app graphite-app bot deleted the 12-02-fix_serverless_misc_pr_fixes branch December 3, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants