Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Dec 2, 2025 7:11pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Dec 2, 2025 7:11pm
rivet-inspector Ignored Ignored Preview Dec 2, 2025 7:11pm
rivet-site Ignored Ignored Preview Dec 2, 2025 7:11pm

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 18, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3482

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3482

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3482

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3482

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3482

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3482

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3482

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3482

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3482

commit: 1a4f15a

@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(gas): fix loop forgotten bug due to concurrency

Summary

This PR fixes a critical concurrency bug in the gasoline workflow engine where loop event history was being incorrectly cleared, potentially losing events from concurrent iterations.


Core Fix Analysis (engine/packages/gasoline/src/db/kv/mod.rs)

The Problem:
The original code used clear_subspace_range(&loop_events_subspace) which cleared all events under the loop location, regardless of iteration. In concurrent scenarios where multiple loop iterations could be in flight, this could inadvertently delete events from the current or future iterations.

The Solution:
The fix correctly bounds the range to only clear events from iterations 0 through iteration - 1:

  • Start: {loop location, 0, ...}
  • End: {loop location, iteration - 1, ...}

This ensures only past iterations are moved to forgotten history, preserving the current iteration's events.

Code Quality: ✅ Excellent

  • Clear variable names (loop_events_subspace_start, loop_events_subspace_end)
  • Helpful comments explaining the range bounds
  • Uses saturating_sub(1) to safely handle iteration 0 edge case
  • Matches the same pattern used for cleaning old forgotten events (lines 2638-2655)

Other Changes Review

1. pegboard-runner/src/conn.rs - Workflow migration

-workflow(pegboard::workflows::runner::Input {
+workflow(pegboard::workflows::runner2::Input {

Concern: ⚠️ This appears to be switching from runner to runner2 workflow.

Questions:

  • Is this migration intentional or accidentally included?
  • Are there backward compatibility implications?
  • Should there be a deprecation path for the old workflow?

2. pegboard/src/workflows/runner2.rs - Init signal handling

-if init.is_none() {
-    init = Some(init_sig);
-    check_queue = true;
-}
+init = Some(init_sig);
+check_queue = true;

Concern: ⚠️ This removes the guard that prevented overwriting an existing init value.

Potential Issue:

  • If ToServerInit can be received multiple times, this will now blindly overwrite previous init data
  • Could this cause race conditions or lost initialization state?

Recommendation: If this change is intentional (e.g., supporting re-initialization), add a comment explaining why. Otherwise, consider keeping the guard or at least logging when init is being overwritten.

3. Cargo.toml - Commented release profile options

+# strip = true
+# panic = "abort"
+# overflow-checks = false
+# debug-assertions = false

Issue: ❌ These commented-out options should not be in the PR

Recommendation: Remove these lines. They appear to be experimental/debugging configurations that were accidentally included.

4. scripts/docker/build-push.sh - Commented build arg

+# --build-arg CARGO_BUILD_MODE=release \

Issue: ❌ Commented code in scripts

Recommendation: Remove if not needed, or uncomment if intended to be active.

5. scripts/run/docker/engine-rocksdb.sh - Trailing newline

Issue: ⚠️ Minor - removes trailing newline

This is inconsequential but typically files should end with a newline per POSIX standards.


Security Considerations

✅ No security concerns identified. The fix actually improves data integrity by preventing unintended data loss.


Performance Considerations

Improved Performance: The bounded range query is more efficient than the original unbounded subspace range, as it scans fewer keys.


Test Coverage

⚠️ Missing Tests

Concerns:

  1. No test coverage for the concurrent loop iteration scenario
  2. Existing loop_test.rs only tests basic sequential loop functionality
  3. The bug being fixed (concurrent iteration event deletion) is not covered by any test

Recommendations:

  1. Add a test that:
    • Runs a loop with slow iterations
    • Triggers concurrent signals/events during loop execution
    • Verifies that events from all iterations are preserved correctly
  2. Add a test specifically for the forgotten history bounds to ensure iteration N-1 events are moved but iteration N events remain

Example test structure:

#[workflow(ConcurrentLoopTestWorkflow)]
pub async fn concurrent_loop_test(ctx: &mut WorkflowCtx, input: &Input) -> Result<()> {
    ctx.loope(0, move |ctx, state| {
        async move {
            // Trigger some events during iteration
            ctx.activity(...).await?;
            
            if *state >= 5 {
                return Ok(Loop::Break(()));
            }
            *state += 1;
            Ok(Loop::Continue)
        }.boxed()
    }).await?;
    
    // Verify all iteration events are in forgotten history
    Ok(())
}

Summary of Recommendations

Must Fix:

  1. ❌ Remove commented Cargo.toml release profile options
  2. ❌ Remove or uncomment the Docker build script changes

Should Address:

  1. ⚠️ Clarify the runner → runner2 migration intent
  2. ⚠️ Review the init signal handling change - document or revert
  3. ⚠️ Add test coverage for concurrent loop scenarios

Nice to Have:

  • Add trailing newline to engine-rocksdb.sh

Overall Assessment

The core bug fix is excellent and addresses a real concurrency issue with a correct, performant solution. However, the PR includes several unrelated or questionable changes that should be addressed before merging.

Recommendation: Request changes to clean up the non-essential modifications and clarify the intentional behavior changes in runner2.rs.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

Summary

This PR fixes a critical concurrency bug in the workflow loop history management where events from all loop iterations were being moved to forgotten history, not just events from completed iterations. The fix correctly bounds the range query to only include events up to the current iteration.


✅ Positives

  1. Critical Bug Fix: The main change correctly addresses the concurrency issue by:

    • Changing from a full subspace query to a bounded range query
    • Using iteration.saturating_sub(1) as the upper bound to exclude the current iteration
    • Properly clearing only the bounded range instead of the entire subspace
  2. Clear Documentation: The comments clearly explain the intent ("Move all events under this loop up to the current iteration to the forgotten history")

  3. Safe Arithmetic: Using saturating_sub(1) prevents underflow if iteration is 0

  4. Consistent Signal Handling: The runner2 workflow fix allows re-initialization signals to be processed correctly by removing the guard condition


🔍 Issues & Concerns

Critical: Missing Key Deletion (engine/packages/gasoline/src/db/kv/mod.rs:2631)

The current code moves entries to forgotten history but doesn't delete the original keys:

tx.set(&forgotten_key, entry.value());
// Missing: tx.clear(entry.key());

Impact: This will leave duplicate data in the active history, potentially causing:

  • Incorrect workflow replay behavior
  • Wasted storage space
  • Query performance degradation

Recommendation: Add tx.clear(entry.key()); after line 2631.


Medium: Potential Off-by-One Issue (engine/packages/gasoline/src/db/kv/mod.rs:2598)

Using iteration.saturating_sub(1) as the exclusive upper bound means events from iteration N-1 will be included in the range. This may be correct, but needs verification:

  • If the current iteration is 5, the range will include iterations 0-4
  • Is this the intended behavior, or should it exclude iteration 4 (the immediately previous iteration)?

Recommendation: Verify with the workflow loop semantics whether the previous iteration's events should be moved to forgotten history or remain active.


Low: Commented Out Production Code (Cargo.toml:437-440, scripts/docker/build-push.sh:29)

Multiple production configuration options are commented out:

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false
# --build-arg CARGO_BUILD_MODE=release \

Impact:

  • These look like debugging/development changes that shouldn't be committed
  • May indicate the PR includes unintended changes
  • Could confuse future developers about why these are commented

Recommendation: Remove these commented lines or document why they're included.


Low: Unnecessary File Change (scripts/run/docker/engine-rocksdb.sh:16)

The removal of a trailing newline appears unrelated to the bug fix.

Recommendation: Revert this cosmetic change to keep the PR focused.


🧪 Testing Concerns

No Test Coverage Evident: The PR doesn't appear to include:

  • Unit tests for the bounded range query logic
  • Integration tests verifying the loop iteration boundary behavior
  • Tests ensuring events from the current iteration remain in active history

Recommendation: Add tests that verify:

  1. Events from completed iterations are moved to forgotten history
  2. Events from the current iteration remain in active history
  3. The edge case where iteration = 0 is handled correctly
  4. Concurrent workflow loops don't interfere with each other

🔒 Security Considerations

No security vulnerabilities identified. The use of saturating_sub prevents integer underflow.


⚡ Performance Considerations

Positive: The bounded range query is more efficient than querying the entire subspace, especially for long-running loops with many iterations.

Note: The loop that moves entries to forgotten history processes them sequentially, which is appropriate for maintaining consistency.


📝 Style & Conventions

According to CLAUDE.md:

✅ Uses structured logging with tracing
✅ Comments are lowercase
✅ Uses anyhow Result types correctly
✅ Follows existing code patterns


🎯 Recommendations Summary

  1. MUST FIX: Add tx.clear(entry.key()); after moving entries to forgotten history
  2. SHOULD VERIFY: Confirm the off-by-one semantics for iteration boundaries
  3. SHOULD REMOVE: Commented out code in Cargo.toml and build scripts
  4. SHOULD ADD: Test coverage for the loop iteration boundary logic

Overall Assessment

The core bug fix is sound and addresses a real concurrency issue, but the missing key deletion is a critical oversight that must be addressed before merging. Once that's fixed and the commented code is cleaned up, this will be a solid improvement to the workflow history management system.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Fix loop forgotten bug due to concurrency

Summary

This PR addresses a critical bug in the workflow engine's loop event history management. The fix ensures that when moving loop events to forgotten history, only events from iterations 0 through (current iteration - 1) are moved, preventing concurrent loop iterations from incorrectly archiving each other's events.

Code Quality & Best Practices ✅

Main Fix (gasoline/src/db/kv/mod.rs:2579-2634)

Good:

  • The fix correctly calculates precise range boundaries using saturating_sub(1) to prevent underflow
  • Clear, descriptive comments explain the intent ("Start is {loop location, 0, ...}" and "End is {loop location, iteration - 1, ...}")
  • Properly uses .range().0 and .range().1 to extract start/end boundaries from the subspace

Observation:

  • The use of saturating_sub(1) on line 2598 is defensive programming for iteration == 0, though logically this function shouldn't be called with iteration == 0. This is good defensive coding.

Workflow Fix (pegboard/src/workflows/runner2.rs:103-108)

Good:

  • Simplifies the initialization logic by removing the unnecessary if init.is_none() guard
  • The change allows re-initialization, which makes sense for handling reconnections or state recovery

Potential Issue:
The removal of the if init.is_none() check means init can be overwritten multiple times if multiple ToServerInit signals arrive in the same batch.

Question: Is this the intended behavior? Consider these scenarios:

  1. If multiple init packets arrive (network retransmission, reconnection), should we always take the latest one?
  2. Could this mask bugs where multiple init packets indicate an unexpected state?

Suggestion: Consider either:

  • Adding a warning log when overwriting an existing init: if init.is_some() { tracing::warn!("overwriting existing init signal"); }
  • Or documenting why multiple inits are expected and safe

Potential Bugs & Issues

1. Edge Case: iteration == 0 ⚠️

When iteration == 0, we have:

  • loop_events_subspace_start: points to iteration 0
  • loop_events_subspace_end: points to iteration 0 (via saturating_sub(1))

This creates a range where start >= end. Need to verify that clear_range handles this gracefully (likely a no-op, which is correct).

2. Missing Entry Deletion ❓

On line 2631, entries are copied to forgotten_history_subspace but the original entries are only deleted via the clear_range at line 2634. Verify that if an error occurs between copying (2631) and clearing (2634), we don't leave duplicate entries or partial state.

Recommendation: Consider transaction semantics - ensure this entire operation is atomic.

Performance Considerations

Range Query Optimization ✅

The fix significantly improves performance by limiting the range query to only the relevant iterations instead of querying all events under the loop location. This prevents O(total iterations) work on each loop update.

Stream Processing ✅

The existing streaming approach (get_ranges_keyvalues with StreamingMode::WantAll) is appropriate for handling potentially large event sets.

Security Concerns

No Major Issues ✅

The changes don't introduce security vulnerabilities. The range calculation is safe from integer overflow due to saturating_sub.

Test Coverage

Missing Tests ⚠️

This PR doesn't include tests. Given the bug was described as "forgotten due to concurrency", recommend adding:

  1. Concurrency test: Multiple concurrent loop iterations to verify isolation
  2. Boundary test: Loop with iteration == 0, iteration == 1
  3. Forgotten history test: Verify events are moved correctly and old events (> 100 iterations) are pruned

Example test scenario:

// Pseudo-code
#[test]
async fn test_concurrent_loop_iterations_dont_interfere() {
    // Create workflow with loop at iteration 5
    // Concurrently:
    //   - Thread A updates loop to iteration 6
    //   - Thread B updates loop to iteration 6
    // Verify: Events from iterations 0-5 are in forgotten
    // Verify: Events from iteration 6 are in active
    // Verify: No events are lost or duplicated
}

Additional Changes Review

Cargo.toml

The commented-out release optimizations appear to be development/debugging leftovers:

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false

Recommendation: Remove these commented lines or explain why they're being kept for future reference.

scripts/docker/build-push.sh

Similarly, commented build arg:

# --build-arg CARGO_BUILD_MODE=release \

Recommendation: Clean up commented code or document the intent.

scripts/run/docker/engine-rocksdb.sh

Removed trailing newline - this is fine, just a formatting cleanup.

Overall Assessment

Status:Approve with minor recommendations

The core fix correctly addresses the concurrency bug by precisely limiting the range of events moved to forgotten history. The logic is sound and the implementation properly uses the database range primitives.

Required before merge:

  • Add tests for concurrent loop iterations
  • Verify behavior when iteration == 0
  • Clean up commented code in Cargo.toml and build scripts

Recommended:

  • Add logging when init is overwritten in runner2.rs
  • Document expected behavior for multiple ToServerInit signals

Great work identifying and fixing this subtle concurrency issue!

@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from e68ccc5 to 9a87fd1 Compare November 20, 2025 03:07
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 4294775 to 5458e50 Compare November 20, 2025 03:07
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug in the workflow loop event history management where events from all iterations (including future ones) were being moved to forgotten history instead of only past iterations. The fix also includes some related cleanup in the pegboard runner workflow.


Critical Bug Fix ✅

Location: engine/packages/gasoline/src/db/kv/mod.rs:2579-2634

The Problem:
The original code used EventHistorySubspaceKey::entire() which creates a range that includes all iterations of a loop (0 to infinity). This meant that when moving events to forgotten history, it would incorrectly include events from the current and future iterations due to concurrent workflow execution.

The Solution:
The fix correctly bounds the range to only include iterations 0 through iteration - 1:

  • Start: {loop location, 0, ...} (iteration 0)
  • End: {loop location, iteration - 1, ...} (one before current)

This ensures only past iterations are moved to forgotten history, which is the correct behavior.

Code Quality: ⭐⭐⭐⭐⭐

  • Uses saturating_sub(1) to safely handle edge case when iteration = 0
  • Clear, descriptive comments explain the range boundaries
  • Proper use of .range() tuple extraction for start/end bounds

Secondary Changes

1. Pegboard Runner Init Logic (runner2.rs:103-108)

Change: Removed the if init.is_none() guard when handling ToServerInit signals.

Analysis:

  • Before: Only accepted the first init signal
  • After: Accepts and overwrites with any new init signal

Potential Concern: ⚠️
Is it expected/safe for a runner to re-initialize multiple times? If this is intentional (e.g., for reconnection scenarios), this is fine. However, if ToServerInit should only be sent once per runner lifecycle, this could mask bugs where duplicate init signals are sent.

Recommendation: Consider adding a log message when overwriting an existing init:

if init.is_some() {
    tracing::warn!("received duplicate ToServerInit signal");
}
init = Some(init_sig);
check_queue = true;

2. Workflow Reference Update (conn.rs:128)

Change: pegboard::workflows::runner::Inputpegboard::workflows::runner2::Input

Analysis: This appears to be migrating to a new runner2 workflow implementation. Straightforward refactor.

Question: Is the old runner workflow being phased out? If so, consider:

  • Adding a deprecation comment to the old workflow
  • Ensuring all callsites are updated in this or future PRs

3. Build Script Changes (scripts/docker/build-push.sh:26-30)

Changes:

  • Multi-line formatting for readability
  • Commented-out --build-arg CARGO_BUILD_MODE=release

Analysis: The commented-out release mode arg suggests this might be work-in-progress or debugging configuration. The formatting improvement is good.


Performance Considerations

Range Query Optimization:
The fix changes from a subspace range query to an explicit bounded range query. This is more efficient because:

  • Avoids scanning events from current/future iterations
  • Database can better optimize with explicit bounds
  • Reduces unnecessary data transfer and processing

Database Operations:

  • tx.clear_range() instead of tx.clear_subspace_range() - both are appropriate, but the explicit range version is more precise given the bounded range

Security Considerations

No security concerns identified. The changes are internal workflow state management operations.


Test Coverage

Recommendation: ⚠️
This bug involved concurrency issues (events from concurrent iterations being incorrectly processed). Consider adding:

  1. Unit test that verifies the range bounds:

    • Create a workflow with loop iterations 0, 1, 2, 3
    • Trigger forgotten history migration at iteration 3
    • Verify only iterations 0-2 are moved to forgotten history
    • Verify iteration 3 remains in active history
  2. Integration test for concurrent execution:

    • Simulate concurrent loop iterations
    • Verify forgotten history is correctly isolated per iteration

Additional Issues Found

Cargo.toml (Cargo.toml:436-440)

Issue: Commented-out release profile configurations:

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false

Recommendation:
These should either be:

  1. Removed entirely if not needed
  2. Uncommented and committed if they're intended configuration
  3. Moved to a separate .cargo/config.toml if they're developer-specific

Leaving commented-out configuration in version control creates confusion about intended settings. According to CLAUDE.md, commented code/config should be avoided.

Script Newline Issue (scripts/run/docker/engine-rocksdb.sh:15)

Issue: Removed trailing newline at EOF

Recommendation: Minor - POSIX compliance suggests files should end with a newline. This is a trivial change but consider preserving it.


Overall Assessment

Code Quality: ⭐⭐⭐⭐ (4/5)
Bug Fix Correctness: ⭐⭐⭐⭐⭐ (5/5)
Test Coverage: ⭐⭐⭐ (3/5 - needs concurrent test)

Approval Status: ✅ APPROVE (with minor suggestions)

The core bug fix is excellent and addresses a real concurrency issue correctly. The secondary changes are reasonable but could use clarification on the init signal handling behavior. Clean up the commented-out Cargo.toml changes before merging.


Action Items Before Merge

  1. Must Fix:

    • Remove or uncomment the Cargo.toml release profile settings
  2. Should Consider:

    • Add logging for duplicate ToServerInit signals
    • Add test coverage for the range boundary fix
    • Clarify the commented-out Docker build arg
  3. Nice to Have:

    • Document the migration from runner to runner2 workflow

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix loop forgotten bug due to concurrency

Overview

This PR fixes a critical bug in the workflow engine's loop history management where concurrent operations could cause events from ALL iterations of a loop to be moved to forgotten history, rather than just the events from completed iterations up to the current one.

Code Quality & Implementation

Main Fix (gasoline/src/db/kv/mod.rs)

The core fix correctly addresses the concurrency issue by:

  1. Precise Range Queries: Changed from using a subspace (which would match ALL iterations) to explicit start/end ranges that only target iterations 0 through iteration - 1.

    • Start: {loop location, 0, ...}
    • End: {loop location, iteration - 1, ...}
  2. Proper Range Clear: Uses clear_range() with explicit bounds instead of clear_subspace_range(), ensuring only the intended range is cleared.

  3. Edge Case Handling: Uses saturating_sub(1) to safely handle iteration = 0 case without underflow.

The comments are also improved, making the intent clearer ("up to the current iteration" instead of just "current events").

Potential Issues & Concerns

1. Minor: Iteration = 0 Edge Case

When iteration = 0, the end range becomes:

iteration.saturating_sub(1) // = 0 when iteration is 0

This means the range would be {location, 0} ... {location, 0}. Depending on how range() works, this might:

  • Return an empty range (correct behavior)
  • Include iteration 0 (incorrect - would try to forget the current iteration's events)

Recommendation: Add a guard condition:

if iteration > 0 {
    // Only move events to forgotten if we have completed iterations
    let loop_events_subspace_start = ...
    // ... rest of the code
}

2. Configuration Changes (Cargo.toml)

The commented-out release profile optimizations are concerning:

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false
  • These should either be removed entirely or have a clear comment explaining why they're included but commented
  • Leaving them suggests they might have been used for debugging this issue
  • overflow-checks = false is particularly dangerous given the .saturating_sub(1) usage in the fix

Recommendation: Remove these lines or add a comment explaining their purpose.

3. Unrelated Changes

Several changes appear unrelated to the loop forgotten bug:

  • pegboard-runner/src/conn.rs: Changed from runner::Input to runner2::Input
  • pegboard/src/workflows/runner2.rs: Removed conditional check for init.is_none(), now always replaces init
  • scripts/docker/build-push.sh: Added commented CARGO_BUILD_MODE argument
  • scripts/run/docker/engine-rocksdb.sh: Removed trailing newline

Recommendation: Consider splitting these changes into separate commits or PRs for clearer history. If they're related to reproducing/testing the bug, please document that relationship.

Performance Considerations

Good: The fix is more efficient than the original - it uses precise range bounds instead of iterating over all loop events.

Security Concerns

⚠️ Medium Priority: The iteration handling needs careful review:

  • Integer underflow is handled with saturating_sub
  • However, the behavior at iteration = 0 should be explicitly tested
  • Consider adding debug assertions or tracing to validate the range is correct

Test Coverage

Missing: No test changes are included in this PR.

Recommendations:

  1. Add a test that creates a loop with multiple iterations and verifies:

    • Only completed iterations' events are moved to forgotten history
    • Current iteration's events remain in active history
    • Concurrent updates don't cause data loss
  2. Add a test for the iteration = 0 edge case

  3. Test the "last 100 events" retention logic (lines 2636-2656)

Additional Observations

Logging: Consider adding structured logging around the forgotten history logic:

tracing::debug!(
    ?from_workflow_id,
    ?location, 
    iteration,
    "moving loop events to forgotten history"
);

This would help diagnose similar issues in the future.

Summary

Strengths:

  • Correctly identifies and fixes the core concurrency bug
  • More efficient implementation
  • Better comments

Action Items:

  1. 🔴 Required: Verify iteration = 0 behavior or add guard condition
  2. 🔴 Required: Add test coverage for this bug fix
  3. 🟡 Recommended: Remove or document commented Cargo.toml changes
  4. 🟡 Recommended: Separate unrelated changes into different commits/PRs
  5. 🟡 Recommended: Add debug logging for troubleshooting

Overall, this is a good fix for a critical bug, but needs test coverage before merging.

@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 5458e50 to 8431b53 Compare November 21, 2025 19:57
@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from 9a87fd1 to 0560d9b Compare November 21, 2025 19:57
@claude
Copy link

claude bot commented Nov 26, 2025

PR Review: fix(gas): fix loop forgotten bug due to concurrency

Summary

This PR addresses a critical bug in the gasoline workflow engine's loop event handling where concurrent loop iterations could cause events from the current iteration to be incorrectly moved to forgotten history.

Code Quality & Best Practices

Strengths

  1. Correct Range Construction - The fix properly constructs bounded ranges to limit which events get archived:

    • loop_events_subspace_start: Points to iteration 0
    • loop_events_subspace_end: Points to iteration - 1
    • This ensures only events from previous iterations are moved to forgotten history, not the current one
  2. Good Use of saturating_sub - Using saturating_sub(1) prevents underflow when iteration = 0, though this case is already handled by the if iteration == 0 branch

  3. Improved Comments - Comments clearly explain the intent:

    • "Move all events under this loop up to the current iteration to the forgotten history"
    • Clear start/end range documentation

⚠️ Areas for Improvement

1. Inconsistent Workflow Reference Path (engine/packages/pegboard-runner/src/conn.rs:131)

.workflow(pegboard::workflows::runner2::Input {

This change from pegboard::workflows::runner::Input to pegboard::workflows::runner2::Input appears to be switching to a new runner implementation. However:

  • Is this intentional for this PR? The commit message mentions a gasoline bug fix, not a runner migration
  • Are both implementations compatible? Need to verify this doesn't break existing runners
  • Should this be a separate PR? Mixing feature changes with bug fixes makes rollback harder

Recommendation: If runner2 is the correct implementation, ensure:

  • This is documented in the PR description
  • All callers have been updated
  • The old runner workflow can be safely deprecated

2. Init Signal Handling Change (engine/packages/pegboard/src/workflows/runner2.rs:106-108)

The change removes the if init.is_none() guard:

// BEFORE
if init.is_none() {
    init = Some(init_sig);
    check_queue = true;
}

// AFTER  
init = Some(init_sig);
check_queue = true;

Concerns:

  • Lost Idempotency: Previously, multiple init signals were ignored. Now they'll overwrite each other
  • Intentional Behavior?: Is it valid for a runner to receive multiple init signals?
  • Race Conditions: Could concurrent init signals cause issues?

Recommendation:

  • Add a comment explaining why reinit is now allowed
  • Or add a warning log when overwriting existing init: if init.is_some() { tracing::warn!("overwriting existing init signal"); }
  • Consider if this could mask bugs where duplicate inits shouldn't happen

3. Commented-Out Build Configurations

In Cargo.toml and scripts/docker/build-push.sh, several release optimizations are commented out:

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false

And in the build script:

# --build-arg CARGO_BUILD_MODE=release \

Questions:

  • Why are these commented instead of removed? Suggests they're being tested or are WIP
  • Should these be in this PR? They don't relate to the loop forgotten bug fix
  • Performance Impact: Commented optimizations could affect production builds

Recommendation:

  • If these are exploratory changes, move to a separate PR
  • If keeping, add comments explaining why they're disabled
  • Consider if this is affecting the bug fix testing

Potential Bugs & Issues

🔴 Critical: Missing Delete Operation (engine/packages/gasoline/src/db/kv/mod.rs:2628)

In the loop that moves events to forgotten history:

loop {
    let Some(entry) = stream.try_next().await? else {
        break;
    };
    
    // ... validation ...
    
    let forgotten_key = [forgotten_history_subspace.bytes(), truncated_key].concat();
    tx.set(&forgotten_key, entry.value());
    // ❌ MISSING: tx.delete(entry.key())
}

tx.clear_range(&loop_events_subspace_start, &loop_events_subspace_end);

Analysis:
The code iterates through events, copies them to forgotten history with tx.set, but doesn't delete the originals in the loop. It relies on tx.clear_range afterwards to delete them.

Is this a bug?

  • If clear_range is called after the loop completes, this is correct (batch delete is more efficient)
  • However, if the transaction fails/retries between the copy and the clear, this could cause:
    • Safe: Keys exist in both places temporarily (duplication, not data loss)
    • ⚠️ Inefficient: Transaction retries might re-copy already copied data

Recommendation:

  • Current approach is acceptable if transactions are atomic
  • Consider adding a comment: // Delete all copied events in batch (more efficient than per-key deletion)
  • Verify that clear_range properly executes even if the transaction retries

⚠️ Edge Case: What if iteration = 0?

When iteration = 0, the code takes a different path (keys::history::insert::loop_event), so the range-based archiving doesn't run. This is correct, but worth documenting.

Recommendation: Add a comment to clarify:

if iteration == 0 {
    // First iteration: no previous events to archive
    keys::history::insert::loop_event(/*...*/)?;
} else {
    // Archive events from iterations 0 to iteration-1
    // ...
}

Performance Considerations

Improved Efficiency

The change from subspace() (which would scan all loop iterations) to explicit range() with start/end boundaries is a significant performance improvement:

// BEFORE: Would scan ALL iterations
let loop_events_subspace = self.subspace.subspace(&EventHistorySubspaceKey::entire(...));

// AFTER: Only scans iterations 0 to iteration-1  
let loop_events_subspace_start = /* iteration 0 */
let loop_events_subspace_end = /* iteration - 1 */

This prevents the bug AND improves performance by limiting the scan range.

⚠️ Potential Concern: Large Iteration Counts

The code keeps the last 100 iterations in forgotten history:

if iteration > 100 {
    // Clean up old forgotten events
}

Questions:

  • For long-running loops with many events per iteration, could forgotten history grow unbounded?
  • Is there a limit on events per iteration that should be enforced?

Recommendation: Consider adding metrics to monitor:

  • Average events per iteration
  • Total forgotten history size
  • Cleanup frequency

Security Concerns

No security issues identified

The changes deal with internal workflow state management and don't expose new attack vectors. The use of bounded ranges actually improves safety by preventing accidental over-deletion.

Test Coverage

Missing: Test Updates

The PR doesn't include any test changes or additions. Given the critical nature of this bug (data being incorrectly archived), this needs tests:

Required Tests:

  1. Concurrent Loop Iterations: Verify that iteration N doesn't archive its own events
  2. Boundary Conditions:
    • iteration = 0 (first iteration)
    • iteration = 1 (should only archive iteration 0)
    • iteration > 100 (forgotten history cleanup)
  3. Range Correctness: Verify events from iterations [0, iteration-1] are archived, not [0, iteration]

Recommendation:

#[test]
async fn test_loop_event_concurrent_iteration_archival() {
    // Setup: Create loop with iteration 5
    // Insert events for iterations 0-5
    // Call upsert_workflow_loop_event with iteration=5
    // Assert: Events for 0-4 are in forgotten, events for 5 are in active
}

Final Recommendations

🔴 Must Address Before Merge:

  1. Clarify the runnerrunner2 change - Is this intentional? Should it be separate?
  2. Add tests for the loop event archival fix
  3. Document or remove the commented-out build configurations
  4. Review the init signal handling change - Is duplicate init allowed?

⚠️ Should Address:

  1. Add inline comments explaining the archival range logic
  2. Add logging for key operations (especially init overwrite)
  3. Consider adding metrics for forgotten history size

Good to Merge After Addressing Above:

The core bug fix (range-based event archival) is solid and addresses a real concurrency issue. Once the peripheral changes are clarified and tests are added, this should be safe to merge.


Overall Assessment: The main bug fix is well-reasoned and correct, but the PR includes several unrelated changes that need clarification. The lack of tests for a critical bug fix is concerning.

@claude
Copy link

claude bot commented Nov 26, 2025

Code Review

I've reviewed this PR which fixes a loop forgotten bug due to concurrency in the gasoline workflow engine. Here's my analysis:

Core Fix: Loop Event Range Selection (engine/packages/gasoline/src/db/kv/mod.rs)

Positive:

  • Critical bug fix: The change correctly addresses a concurrency issue where all loop events were being moved to forgotten history, not just events from previous iterations
  • Precise range selection: The new implementation uses specific start/end ranges:
    • Start: {loop location, 0, ...}
    • End: {loop location, iteration - 1, ...}
  • Overflow safety: Uses saturating_sub(1) to handle the edge case when iteration is 0
  • Improved comments: Better documentation explaining the range bounds

Issues & Suggestions:

  1. Potential edge case: When iteration == 0, the range end becomes iteration.saturating_sub(1) which is 0. This means the range is [0, 0) - an empty range. This is correct behavior (no events to forget on first iteration), but worth verifying this is intentional.

  2. Missing transaction error handling: Line 2628 sets a value in the transaction (tx.set(&forgotten_key, entry.value())), but there's no explicit error handling if this fails. Consider whether this could leave the DB in an inconsistent state if it fails after some entries are moved.

  3. Performance consideration: The code does a stream read + individual tx.set() calls + a final clear_range(). For large iteration counts with many events, this could be slow. Consider if batch operations would be more efficient.

Pegboard Runner Changes (engine/packages/pegboard/src/workflows/runner2.rs)

Line 105-108:

init = Some(init_sig);
check_queue = true;

Issue: The removal of the if init.is_none() check means that if multiple ToServerInit signals arrive, the init will be overwritten.

  • Question: Is this intentional? Can multiple init signals arrive due to concurrency/retries?
  • If multiple inits can arrive, should we log a warning when overwriting?
  • If only one init should arrive, should we error/warn on duplicate inits?

Recommendation: Add a check or comment explaining why it's safe to overwrite init, or add error handling if this shouldn't happen.

Commented Code Issues

Cargo.toml (lines 436-439):

# strip = true
# panic = "abort"
# overflow-checks = false
# debug-assertions = false

Critical: These commented-out profile settings should not be committed to the codebase:

  • If they're temporary debugging settings, remove them
  • If they're intended for future use, document them with a TODO and context
  • Commented code in version control is an anti-pattern

scripts/docker/build-push.sh (line 29):

# --build-arg CARGO_BUILD_MODE=release \

Issue: Commented build arg in the middle of a command. This suggests:

  • Either this was used for debugging and should be removed
  • Or there's an incomplete feature that should be finished or documented
  • The script has a syntax error risk if uncommented without proper formatting

scripts/run/docker/engine-rocksdb.sh (line 18):

  • Removed trailing newline - this is fine but seems unrelated to the PR

Security & Best Practices

No security concerns identified - The changes are internal database operations and workflow state management.

Best practices observations:

  • ✅ Good use of structured logging with tracing
  • ✅ Proper error propagation with ? operator
  • ✅ Range safety with saturating_sub
  • ⚠️ Commented code should be cleaned up

Test Coverage

Recommendation: This PR fixes a concurrency bug. Consider adding:

  1. A test that simulates concurrent loop iterations to verify events are properly scoped
  2. Edge case test for iteration == 0
  3. Test verifying the 100-event retention limit still works correctly

Summary

Approval recommendation: ⚠️ Approve with changes

The core bug fix is sound and addresses a real concurrency issue. However:

  1. Must fix: Remove commented code from Cargo.toml and scripts/docker/build-push.sh
  2. Should clarify: The init overwrite behavior in runner2.rs - is this intentional?
  3. Consider: Add tests for the concurrency scenario this fixes

Once the commented code is cleaned up and the init overwrite is clarified, this should be good to merge.

@claude
Copy link

claude bot commented Nov 27, 2025

Code Review - PR #3482: fix(gas): fix loop forgotten bug due to concurrency

Summary

This PR fixes a critical concurrency bug in the loop event history management system where events from all loop iterations were being moved to forgotten history, rather than just events up to the current iteration.


Code Quality & Best Practices

Strengths

  1. Clear bug fix: The change correctly addresses the issue by using a bounded range instead of clearing all events under a loop
  2. Good comments: Updated comments accurately reflect the new behavior ("up to the current iteration")
  3. Appropriate method usage: Changed from clear_subspace_range to clear_range with explicit bounds
  4. Safe arithmetic: Uses saturating_sub(1) to prevent underflow on iteration 0

⚠️ Areas for Improvement

  1. Commented code in Cargo.toml (lines 436-439):

    # strip = true
    # panic = "abort"
    # overflow-checks = false
    # debug-assertions = false

    These commented-out release profile options should either be:

    • Removed if not needed
    • Uncommented if they're part of the fix
    • Moved to a separate PR if they're exploratory changes

    Recommendation: Remove these lines to keep the PR focused on the bug fix.

  2. Commented code in scripts/docker/build-push.sh (line 29):

    # --build-arg CARGO_BUILD_MODE=release \

    Similar issue - this commented line should be removed or moved to a separate PR.

  3. Missing newline at EOF: scripts/run/docker/engine-rocksdb.sh now ends without a trailing newline (line deleted). This is a minor style issue but worth noting.


Correctness & Logic

Core Fix is Sound

Before: The code was clearing ALL events under a loop location:

let loop_events_subspace = self.subspace.subspace(&keys::history::EventHistorySubspaceKey::entire(...));
tx.clear_subspace_range(&loop_events_subspace);

After: The code now clears only events from iteration 0 to (current iteration - 1):

let loop_events_subspace_start = /* iteration 0 */
let loop_events_subspace_end = /* iteration - 1 */
tx.clear_range(&loop_events_subspace_start, &loop_events_subspace_end);

This ensures that events from the current iteration and beyond are preserved.

⚠️ Potential Edge Cases

  1. Iteration 0 behavior: When iteration = 0, saturating_sub(1) returns 0, so both start and end will point to iteration 0. This means:

    • The range will be [0, 0) (exclusive end)
    • No events will be cleared
    • This appears correct, but should be verified in testing
  2. Range semantics: Verify that clear_range uses exclusive end semantics. The code assumes that range().1 gives an exclusive upper bound, which should be validated.


Performance Considerations

Good Performance Characteristics

  1. Bounded range queries: The change from clearing entire subspaces to bounded ranges is more efficient and precise
  2. No additional database roundtrips: The logic still uses a single transaction

💡 Potential Optimization

The code still iterates through all matching events to move them to forgotten history, then clears the range. Consider whether this could be optimized with a single range operation if the database supports atomic move/rename operations.


Security Concerns

No Security Issues Identified

  • The change doesn't introduce any new attack vectors
  • Proper key construction using the existing tuple packing system
  • No user input directly influences the range bounds

Test Coverage

⚠️ Testing Concerns

  1. No test changes in this PR: This is a critical bug fix but doesn't include new tests or updates to existing tests
  2. Recommended test scenarios:
    • Loop event cleanup at iteration 0
    • Loop event cleanup at iteration 1
    • Loop event cleanup after 100+ iterations (to verify the cleanup logic)
    • Concurrent loop iterations to verify the race condition is fixed
    • Verify events from current and future iterations are NOT moved to forgotten history

Recommendation: Add integration tests that verify:

// Test case: Events from current iteration should remain in active history
// Test case: Events from previous iterations should move to forgotten history  
// Test case: Cleanup should work correctly at iteration boundaries (0, 1, 100, 101)

Additional Changes Review

pegboard/src/workflows/runner2.rs

Change: Removed conditional check for init.is_none()

// Before:
if init.is_none() {
    init = Some(init_sig);
    check_queue = true;
}

// After:
init = Some(init_sig);
check_queue = true;

Analysis: This allows re-initialization, which may be intentional behavior to handle reconnection scenarios. However, this seems like a significant behavior change that should be:

  1. Mentioned in the PR description
  2. Verified as safe (won't cause issues with existing initialized states)
  3. Potentially separated into its own PR

pegboard-runner/src/conn.rs

Change: Updated workflow reference from runner to runner2

This appears to be switching to a new workflow implementation. Should verify:

  • Is runner2 fully tested and ready for production?
  • Is there a migration plan for existing runner workflows?
  • Should this be in a separate PR?

Recommendations

  1. High Priority: Remove commented code from Cargo.toml and build-push.sh
  2. High Priority: Add test coverage for the loop forgotten event fix
  3. Medium Priority: Clarify/separate the runner2 migration and init behavior changes
  4. Low Priority: Add trailing newline to engine-rocksdb.sh

Verdict

Approve with suggested changes

The core bug fix is sound and addresses a real concurrency issue. The main concerns are:

  • Lack of test coverage for the fix
  • Unrelated changes (commented code, runner2 migration) that should be cleaned up or separated

The logic change correctly bounds the range of events to move to forgotten history, preventing future iterations' events from being prematurely archived.

@claude
Copy link

claude bot commented Nov 27, 2025

Code Review - PR #3482: fix(gas): fix loop forgotten bug due to concurrency

Summary

This PR fixes a critical bug in the workflow engine's loop event management where concurrent loop iterations could cause events from all iterations to be moved to forgotten history, rather than just events from previous iterations. The fix properly bounds the range query to only affect events up to iteration - 1.


Main Changes Analysis

1. Core Bug Fix: Loop Event Range Limiting (gasoline/src/db/kv/mod.rs)

Issue Fixed:
The original code used subspace() which creates a range that matches all events under a loop location regardless of iteration number. When moving events to forgotten history on iteration N, this incorrectly included events from the current iteration N.

Solution:

  • Changed from using a single subspace range to explicitly constructed start/end bounds
  • Start: {loop location, 0, ...} (iteration 0)
  • End: {loop location, iteration - 1, ...} (one before current)
  • This ensures only events from completed iterations (0 to N-1) are moved to forgotten history

Code Quality: Excellent

  • The fix correctly uses .range().0 and .range().1 to get precise byte boundaries
  • Uses saturating_sub(1) to safely handle iteration 0 edge case
  • Clear comments explain the range boundaries
  • Correctly updates both the query range (get_ranges_keyvalues) and clear range (clear_range)

Correctness: Verified
The logic is sound:

  1. When iteration == 0: No previous iterations exist, so the range query returns empty (start = end after saturating_sub)
  2. When iteration >= 1: Range includes only iterations 0 to (iteration - 1)

2. Runner Init Signal Handling (pegboard/src/workflows/runner2.rs)

Change: Removed the if init.is_none() guard when processing ToServerInit signals.

Impact: Potentially Problematic

Concerns:

  1. Duplicate Init Signals: If a runner sends multiple init signals (reconnection, retries, bugs), the workflow will now process all of them rather than ignoring duplicates
  2. State Reset Risk: Re-initializing could potentially reset important state that should be preserved
  3. check_queue Trigger: Setting check_queue = true on every init could cause unnecessary queue processing

Questions:

  • Is this intentional to handle runner reconnections?
  • What happens if a runner sends init twice (network retry, bug, etc.)?
  • Should there be validation that subsequent inits match the first one?

Recommendation: Add a comment explaining why duplicate inits should be allowed, or add validation.


3. Workflow Reference Update (pegboard-runner/src/conn.rs)

Change: pegboard::workflows::runner to pegboard::workflows::runner2

Impact: Looks correct but raises questions:

  • Is this part of a migration from old runner to new runner2?
  • Are both implementations running in production?
  • Should old runner be deprecated/removed?

4. Commented-Out Cargo.toml Changes

Impact: Leftover debug code

These commented lines should either be removed entirely if not needed for this PR, uncommented if they're intended changes, or moved to a separate PR if they're unrelated optimizations.

Concerns:

  • overflow-checks = false and debug-assertions = false can hide bugs in production
  • panic = abort changes error handling behavior significantly

Recommendation: Remove these commented lines from this PR unless they're directly related to the bug fix.


5. Docker Build Script Changes (scripts/docker/build-push.sh)

Change: Commented-out CARGO_BUILD_MODE=release argument

Impact: Leftover debug code

Similar to Cargo.toml - should be removed unless intentional for this PR.


Security Considerations

No security concerns identified

  • The fix properly bounds database queries, preventing data leakage
  • No new external inputs or untrusted data handling

Performance Considerations

Performance Improved

  • The range query is now more precise, scanning fewer keys
  • Moving to clear_range with explicit bounds is more efficient than clearing an entire subspace

Testing Recommendations

  1. Critical: Test loop iterations to verify events are correctly partitioned:

    • Run a loop workflow for 5+ iterations
    • Verify forgotten history only contains events from previous iterations
    • Verify active history contains current iteration events
  2. Edge Cases:

    • Test iteration 0 (no previous iterations)
    • Test iteration 1 (first move to forgotten)
    • Test the 100-iteration cleanup threshold
  3. Runner2 Init:

    • Test duplicate init signals behavior
    • Test runner reconnection scenarios
    • Verify queue processing isn't triggered unnecessarily

Code Style and Conventions

Follows repository conventions:

  • Uses structured logging (tracing)
  • Proper error handling with ? operator
  • Clear comments explaining complex logic
  • Hard tabs for indentation (per rustfmt.toml)

Summary and Recommendations

Approval Status: Approve with reservations

Must Address:

  1. Remove commented-out Cargo.toml and build script changes (or explain why they're needed)
  2. Clarify the runner2 init signal handling change - is duplicate processing intentional?

Should Address:
3. Add validation or comments for the init signal handling change
4. Document the runner to runner2 migration if that's ongoing

Core Bug Fix: Excellent - The main fix for the loop forgotten bug is correct and well-implemented.


Great work on identifying and fixing this concurrency bug! The core fix is solid. The main concerns are the ancillary changes that seem like debug/development artifacts rather than intentional changes for this PR.

@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from 6eecd1a to 8c26dca Compare November 27, 2025 00:58
@MasterPtato MasterPtato force-pushed the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch from 9893845 to 7dad582 Compare November 27, 2025 00:58
@graphite-app
Copy link
Contributor

graphite-app bot commented Dec 3, 2025

Merge activity

  • Dec 3, 12:33 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Dec 3, 12:34 AM UTC: CI is running for this pull request on a draft pull request (#3577) due to your merge queue CI optimization settings.
  • Dec 3, 12:34 AM UTC: Merged by the Graphite merge queue via draft PR: #3577.

@graphite-app graphite-app bot closed this Dec 3, 2025
@graphite-app graphite-app bot deleted the 11-18-fix_gas_fix_loop_forgotten_bug_due_to_concurrency branch December 3, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants