Skip to content

Conversation

@vicLin8712
Copy link
Collaborator

@vicLin8712 vicLin8712 commented Nov 19, 2025

O(1) scheduler: Complete implementation

This PR provides the complete O(1) scheduler implementation and serves as the final part of the 3-part stacked PR series.
It integrates all components introduced in the earlier patches and replaces the legacy O(n) linear scheduler with the new ready-queue–based, RR-cursor-based, bitmap-assisted O(1) design.

Feature of O(1) scheduler

  • Priority-indexed ready queues
    Each priority level maintains an independent ready queue.

  • Bitmap + De Bruijn–based highest-priority lookup
    The scheduler locates the next runnable task in constant time using priority bitmaps and De Bruijn table lookup.

  • RR cursor for fair round-robin scheduling
    Each priority queue maintains a cursor to provide O(1) fair scheduling among tasks of the same priority.

  • Full integration into the scheduler execution path
    The legacy O(n) priority scanning algorithm is completely replaced by the new O(1) logic; the iteration limit IMAX=500 is removed.

  • Idle task fully integrated into new design
    System execution starts in the idle task, which serves as the initial execution context.
    Whenever the idle task yields, control deterministically transitions to the highest-priority runnable task.

Unit tests

Commit: Add unit test suite for RR-cursor scheduler

make sched_cmp
make run

Approach

  • A dedicated controller task is created with priority TASK_PRIO_CRIT to orchestrate the entire test process and enforce deterministic sequencing.
  • After each state change of the test tasks, the unit test verifies both bitmap correctness and per-priority task-count consistency, ensuring alignment with the ready-queue and priority-bitmask invariants maintained by the O(1) scheduler.
    Task types
  • Controller task: Coordinates the test flow, triggers all state transitions, and validates ready-queue invariants after each step.
  • Delay task: A runnable task that transitions into TASK_BLOCKED through mo_task_delay().
    Used to verify dequeue behavior and correct clearing of priority bits when a task leaves the schedulable state set.
  • Normal task: A simple infinite-loop runnable task that remains schedulable unless externally suspended or cancelled.
    Serves as the primary subject for testing state transitions and enqueue/dequeue correctness.

Verified state points
The following state transitions are validated by checking both ready-queue task counts and bitmap updates after each operation:

  • Normal task state transitions

    • Creation (TASK_READY) – initial enqueue and priority bit set.
    • Priority change – priority migration updates to queue placement and the corresponding bitmap bit.
    • Suspension (TASK_READYTASK_SUSPEND) – dequeued from the ready queue and priority bit cleared.
    • Resumption (TASK_SUSPENDTASK_READY) – re-enqueued with correct priority placement.
    • Cancellation (TASK_READYTASK_CANCELLED) – removed from ready queues and all bitmap bits fully cleared.
  • Blocked task behavior (TASK_RUNNINGTASK_BLOCKED)

    • The delay task is created and its priority is promoted to match the controller task’s priority (TASK_READY).
    • After the controller yields, the delay task becomes the running task, invokes mo_task_delay(), and transitions to TASK_BLOCKED.
    • Control returns to the controller task, and the test verifies:
      • the delay task is completely removed from the ready queue
      • its priority bit is cleared from the bitmap
      • scheduler selection falls back to the highest remaining runnable task

Results

Linmo kernel is starting...
Heap initialized, 130005992 bytes available
idle id 1: entry=80001900 stack=80004488 size=4096
task 2: entry=80000788 stack=80005508 size=4096 prio_level=4 time_slice=5
Scheduler mode: Preemptive
Starting RR-cursor based scheduler test suits...

=== Testing Bitmap and Task Count Consistency ===
task 3: entry=80000168 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Bitmap is consistent when TASK_READY
PASS: Task count is consistent when TASK_READY
PASS: Bitmap is consistent when priority migration
PASS: Task count is consistent when priority migration
PASS: Bitmap is consistent when TASK_SUSPENDED
PASS: Task count is consistent when TASK_SUSPENDED
PASS: Bitmap is consistent when TASK_READY from TASK_SUSPENDED
PASS: Task count is consistent when TASK_READY from TASK_SUSPENDED
PASS: Bitmap is consistent when task canceled
PASS: Task count is consistent when task canceled
task 4: entry=80000178 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Task count is consistent when task canceled
PASS: Task count is consistent when task blocked

=== Test Results ===
Tests passed: 12
Tests failed: 0
Total tests: 12
All tests PASSED!
RR-cursor based scheduler tests completed successfully.

Note

  1. The term TASK_CANCELLED in this document is used only for explanation. It is not an actual state in the task state machine, but represents the condition where a task has been removed from all scheduling structures and no longer exists in the system.
  2. The task states shown in parentheses (e.g., (TASK_READY)) refer to the state of the test tasks being created or manipulated, not the state of the controller task.

Benchmark

Commit: Add benchmarking files

python3 bench.py

Approach

  1. Spawn N=500 normal tasks to populate the scheduling domain.
    All tasks begin in the TASK_READY state, ensuring the ready queues and bitmap are fully populated.
  2. Scenario configuration (active ratio)
    For each benchmark scenario, suspend a portion of tasks to reach the desired active-ratio load:
  • 2% active
  • 4% active
  • 20% active
  • 50% active
  • 100% active
  1. Benchmark execution
    To compare the legacy O(n) scheduler with the new O(1) scheduler, a compile-time flag OLD is passed to select which scheduling algorithm is active.
    The original linear-search scheduler is preserved in task.c for baseline measurement.
    For each benchmark scenario, the scheduler is executed 20 times to obtain stable timing data.
    The average and maximum scheduling latencies are collected, and the performance improvement is computed as the ratio between the old and new scheduler times (e.g., 1.5× faster).

  2. Metrics collected
    The benchmark collects the following metrics for each scenario:

    • Mean improvement
      Average speedup factor computed as (old_latency / new_latency) across 20 runs.

    • Standard deviation of improvement
      Measures the variability of speedup across repeated runs.

    • Minimum / maximum improvement
      Best and worst observed speedup factors among the 20 runs.

    • 95% confidence interval (CI)
      Statistical confidence bounds for the mean improvement.

    • Mean scheduling latency (old / new)
      Average schedule-selection time for both the legacy O(n) scheduler and the new O(1) scheduler.

    • Maximum scheduling latency (old / new)
      Worst-case schedule-selection time observed for each scheduler.
      Results

Scenario 'Minimal Active':                                                                                                                                                                                                 
  mean improvement        = 2.68x faster                                                                                                                                                                                   
  std dev of improvement  = 0.34x                                                                                                                                                                                          
  min / max improvement   = 1.75x  /  3.35x                                                                                                                                                                                
  95% CI of improvement   = [2.54x, 2.83x]                                                                                                                                                                                 
  mean old sched time     = 5616.25 us                                                                                                                                                                                     
  mean new sched time     = 2119.0 us                                                                                                                                                                                      
  max  old sched time     = 47.0 us 
  max  new sched time     = 37.0 us 

Scenario 'Moderate Active':
  mean improvement        = 1.80x faster
  std dev of improvement  = 0.27x
  min / max improvement   = 1.27x  /  2.51x
  95% CI of improvement   = [1.68x, 1.92x]
  mean old sched time     = 3887.6 us 
  mean new sched time     = 2179.45 us 
  max  old sched time     = 40.0 us 
  max  new sched time     = 23.0 us 

Scenario 'Heavy Active':
  mean improvement        = 1.02x faster
  std dev of improvement  = 0.08x
  min / max improvement   = 0.84x  /  1.17x
  95% CI of improvement   = [0.98x, 1.06x]
  mean old sched time     = 2150.15 us 
  mean new sched time     = 2119.1 us 
  max  old sched time     = 73.0 us 
  max  new sched time     = 33.0 us 

Scenario 'Stress Test':
  mean improvement        = 0.93x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.65x  /  1.20x
  95% CI of improvement   = [0.88x, 0.98x]
  mean old sched time     = 1874.35 us 
  mean new sched time     = 2032.55 us 
  max  old sched time     = 23.0 us 
  max  new sched time     = 20.0 us 

Scenario 'Full Load Test':
  mean improvement        = 0.89x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.63x  /  1.07x
  95% CI of improvement   = [0.84x, 0.94x]
  mean old sched time     = 1798.8 us 
  mean new sched time     = 2048.55 us 
  max  old sched time     = 33.0 us 
  max  new sched time     = 52.0 us
image image

Reference

#23 - Draft discussion
#36 - Infrustrue
#37 - Task state transitions APIs
ae35c84 - unit test test suite
11e9ee6 - benchmark


Summary by cubic

Complete O(1) scheduler with priority queues, bitmap selection, and RR cursors, replacing the legacy O(n) scan. Adds an idle task and updates task lifecycle to use ready queues; up to ~2.7x faster under light load.

  • New Features

    • Priority-indexed ready queues with O(1) highest-priority selection via bitmap.
    • Per-priority round-robin cursors for fair rotation without list churn.
    • Scheduler state in kcb (ready_bitmap, ready_queues[], rr_cursors[]) and a dedicated idle task as the safe fallback.
    • Intrusive ready-queue design: TCB embeds rq_node; helpers list_pushback_node() and list_remove_node() manage nodes safely.
    • Unit tests validate bitmap/queue invariants; benchmarks show strong gains at low/moderate activity.
  • Refactors

    • Tasks explicitly enqueue/dequeue on READY/RUNNING transitions (spawn, delay/block, suspend/resume, cancel, priority change).
    • Blocking paths use _sched_block_dequeue() and _sched_block_enqueue() for mutex/cond/semaphore to reinsert tasks correctly.
    • Priority changes migrate tasks between queues and yield if the running task changes its priority.
    • Startup launches into the idle task (idle_task_init) and removes the IMAX scan limit.

Written for commit 77bb7ae. Summary will update automatically on new commits.

@vicLin8712 vicLin8712 mentioned this pull request Nov 19, 2025
28 tasks
@jserv jserv changed the title [3/3] O(1) scheduler: Complete implementation O(1) scheduler: Complete implementation Nov 19, 2025
@jserv
Copy link
Contributor

jserv commented Nov 19, 2025

Do not include numbers in pull-request titles.

@vicLin8712 vicLin8712 force-pushed the o1-sched-lauch branch 2 times, most recently from b18ebac to 0d8c856 Compare November 19, 2025 18:19
This commit extends the core scheduler data structures to support
the new O(1) scheduler design.

Adds in tcb_t:

 - rq_node: embedded list node for ready-queue membership used
   during task state transitions. This avoids redundant malloc/free
   for per-enqueue/dequeue nodes by tying the node's lifetime to
   the task control block.

Adds in kcb_t:

 - ready_bitmap: 8-bit bitmap tracking which priority levels have
   runnable tasks.
 - ready_queues[]: per-priority ready queues for O(1) task
   selection.
 - rr_cursors[]: round-robin cursor per priority level to support
   fair selection within the same priority.

These additions are structural only and prepare the scheduler for
O(1) ready-queue operations; they do not change behavior yet.
Previously, list_pushback() and list_remove() were the only list APIs
available for data insertion into and removal from the list by malloc
a new and free target linkage node.

After the new data structure, rq_node, is added as the linkage node
for ready queue operation purpose, there is no need to malloc and
free each time.

This commit adds the insertion and removal list operations without
malloc and free on the linkage node.

- list_pushback_node(): append an existing node to the end of the
   list in O(n) time without allocating memory.

- list_remove_node(): remove a node from the list without freeing it.

Both helper functions are operated in O(n) by linearly searching
method and will be applied in the upcoming task dequeuing/enqueuing
from/into the ready queue operations.
Previously, `sched_enqueue_task()` only marked task state as TASK_READY
to represent the task has been enqueued due to the original scheduler
selects the next task based on the global list and all tasks are kept in
it.

After new data structure, ready_queue[], is added for keeping runnable
tasks, the enqueuing task API should push the embedded linkage list node,
rq_node, into the corresponding ready_queue.

This commit uses list_pushback_node() helper to enqueue the embeded list
node of tcb into ready queue and sets up cursor and bitmap of the
corresponding priority queue.
Previously, sched_dequeue_task() was a no-op stub, which was sufficient
when the scheduler selected tasks from the global list. Since new data
structure, ready_queue, is added for keeping all runnable tasks, a dequeue
path is required to remove tasks from ready queue to ensure it always
holds runnable tasks.

This commit adds the dequeue path to sched_dequeue_task(), using
list_remove_node() helper to remove the existing linkage node from the
corresponding ready queue and update the RR cursor and priority bitmap
accordingly.
Previously, task operation APIs such as sched_wakeup_task() only updated
the task state, which was sufficient when scheduling relied on the global
task list. With the scheduler now selecting runnable tasks from
ready_queue[] per priority level, state changes alone are insufficient.

To support the new scheduler and to prevent selection of tasks that have
already left the runnable set, explicit enqueue and dequeue paths are
required when task state transitions cross the runnable boundary:

    In ready-queue set: {TASK_RUNNING, TASK_READY}
    Not in ready-queue set: {all other states}

This commit updates task operation APIs to include queue insertion and
removal logic according to their semantics. In general, queue operations
are performed by invoking existing helper functions mo_enqueue_task()
and mo_dequeue_task().

The modified APIs include:

  - sched_wakeup_task(): avoid enqueueing a task that is already running
    by treating TASK_RUNNING as part of the runnable set complement.

  - mo_task_cancel(): dequeue TASK_READY tasks from ready_queue[] before
    cancelling, ensuring removed tasks are not scheduled again.

  - mo_task_delay(): runnable boundary transition only ("TASK_RUNNING →
    TASK_BLOCKED"), no queue insertion for non-runnable tasks.

  - mo_task_suspend(): supports both TASK_RUNNING and TASK_READY
    ("TASK_RUNNING/TASK_READY → TASK_SUSPENDED"), dequeue before suspend
    when necessary.

  - mo_task_resume(): only for suspended tasks ("TASK_SUSPENDED →
    TASK_READY"), enqueue into ready_queue[] on resume.

  - _sched_block(): runnable boundary transition only ("TASK_RUNNING →
    TASK_BLOCKED"), dequeue without memory free.
Currently, mo_mutex_lock() will call mutex_block_atomic() to mark the
running task as TASK_BLOCKED so that it won't be selected by the old scheduler.
To support the ready queue consistency that always keeps runnable tasks,
the dequeuing path should be included when mutex_block_atomic() is
called.

This commit adds _sched_blocked_dequeue() helper and will be applied in
mutex_block_atomic() in the following commit.
Previously, mutex_block_atomic() only marked the running task as
TASK_BLOCKED, which was sufficient when scheduling selected tasks
by scanning the global task list.

Since the new scheduler is designed to select only runnable tasks
from ready_queue[], mutex blocking now also requires removing the
task’s rq_node from the corresponding ready queue, preventing the
scheduler from selecting a blocked (non-runnable/dequeued) task again.
Currently, there is no enqueueing API that can be invoked from other files,
especially in mutex and semaphore operations which include task state
transition from TASK_BLOCKED to TASK_READY when a held resource is released.

This change introduces the _sched_blocked_enqueue() helper, which will be
used by mutex/semaphore unblocking paths to insert the task’s existing
linkage node into the corresponding per-priority ready queue, keeping
scheduler visibility and ready-queue consistency.
This commit replaces unblocking state transitions (TASK_BLOCKED->TASK_READY)
in mutex and semaphore paths with the _sched_block_enqueue() helper to
ensure scheduler visibility and preserve ready-queue invariants.
Previously, mo_task_priority() only updated the task’s time slice and
priority level. With the new scheduler design, tasks are kept in
per-priority ready queues, so mo_task_priority() must also handle
migrating tasks between these queues.

This commit adds dequeue/enqueue logic for tasks in TASK_RUNNING or
TASK_READY state, as such tasks must reside in a ready queue and a
priority change implies ready-queue migration.

The priority fields are still updated as part of the migration path:
sched_dequeue_task() relies on the current priority, while the enqueue
operation needs the new priority. Therefore, the priority update is
performed between the dequeue and enqueue steps.

If the priority change happens while the task is running, it must yield
immediately to preserve the scheduler’s strict task-ordering policy.
This commit refactors mo_task_spawn() to align with the new O(1) scheduler
design. The task control block (tcb_t) embeds its list node during task
creation.

The enqueue operation is moved inside a critical section to guarantee
consistent enqueuing process during task creation.

The “first task assignment” logic is removed because first task has been
assigned to system idle task as previous commit mentioned.
Previously, the scheduler performed an O(N) scan of the global task list
(kcb->tasks) to locate the next TASK_READY task. This resulted in
non-deterministic selection latency and unstable round-robin rotation
under heavy load or frequent task state transitions.

This change introduces a strict O(1) scheduler based on per-priority
ready queues and round-robin (RR) cursors. Each priority level maintains
its own ready queue and cursor, enabling constant-time selection of the
next runnable task while preserving fairness within the same priority.
@vicLin8712 vicLin8712 force-pushed the o1-sched-lauch branch 2 times, most recently from 7e130fd to 164ad8d Compare November 29, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants