-
Notifications
You must be signed in to change notification settings - Fork 28
O(1) scheduler #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
O(1) scheduler #36
Conversation
|
Do not include numbers in pull-request subjects. |
jserv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest 'main' branch to resolve unintended CI/CD regressions.
include/sys/task.h
Outdated
| /* Hart-Specific Data */ | ||
| uint8_t hart_id; /* RISC-V hart identifier */ | ||
|
|
||
| } sched_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I’ve asked several questions about per-hart data structures before, but I didn't get a definitive answer at the time. If we want to split out per-hart data structures to facilitate future SMP integration, how do we distinguish which queues should go into the per-hart structure and which should remain in the global one in a multi-hart scenario? You retained the sched_t design, but I'm still unclear if this is a reasonable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the description for Patch 1 doesn't seem quite right to me. It states that sched_t is introduced to support the O(1) scheduler. However, IIUC, we could achieve the exact same behavior by placing everything in kcb_t if we disregard SMP support for now. Therefore, the existence of sched_t seems to be more about preparing for future SMP support rather than enabling the O(1) scheduler itself. Did I miss something? Is there actually a direct connection between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point. I originally separated the scheduler-related data structures for readability, but it’s not essential for this PR.
I will embed the separated section back into kcb_t to keep this patch focused on the functional changes.
Thanks for your suggestions.
|
I'm also a bit confused as to why the original PR was closed and split into three separate ones. It seems all three are related to O(1) scheduler support and are interdependent. This actually makes the review process more difficult because the subsequent PRs include commits and changes from the previous ones. It also makes it much harder for me to track down previous discussions. |
Thanks for the feedback, and I understand the confusion. The original PR contained too many changes at once — new data structures, new APIs, logic refactoring, and the introduction of the O(1) scheduler. I split the work into three PRs (data structures, task state transitions, and final scheduler activation) so the series would be easier to review. Each PR remains compilable and all applications in apps/ continue to run with the old scheduler. However, I understand that the current split causes later PRs to include commits from earlier ones, which complicates the review process. Could you share your thoughts on what approach would be most appropriate for this project? For example, would you prefer a single PR with a clean commit history, or a small series of self-contained PRs? I’d be happy to reorganize the work to match the project’s preferred workflow. |
24c5367 to
eaf8b0c
Compare
e995be1 to
6f37b93
Compare
|
@jserv , The following atomic operation in /* Atomic block operation with enhanced error checking */
static void mutex_block_atomic(list_t *waiters)
{
...
/* Block and yield atomically */
self->state = TASK_BLOCKED;
_yield(); /* This releases NOSCHED when we context switch */
}In this commit series, the dequeuing path has been added in |
1d9d7a3 to
fc2a929
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 6 files
Prompt for AI agents (all 2 issues)
Understand the root cause of the following 2 issues and fix them.
<file name="kernel/task.c">
<violation number="1" location="kernel/task.c:914">
First task creation no longer initializes `kcb->task_current`, so the new scheduler immediately panics on startup because `sched_select_next_task` still requires a non-null current-task pointer.</violation>
</file>
<file name="include/lib/list.h">
<violation number="1" location="include/lib/list.h:106">
list_pushback_node dereferences target->next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
| tcb->rq_node.data = tcb; | ||
|
|
||
| /* Push node to ready queue */ | ||
| sched_enqueue_task(tcb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First task creation no longer initializes kcb->task_current, so the new scheduler immediately panics on startup because sched_select_next_task still requires a non-null current-task pointer.
Prompt for AI agents
Address the following comment on kernel/task.c at line 914:
<comment>First task creation no longer initializes `kcb->task_current`, so the new scheduler immediately panics on startup because `sched_select_next_task` still requires a non-null current-task pointer.</comment>
<file context>
@@ -770,8 +907,11 @@ int32_t mo_task_spawn(void *task_entry, uint16_t stack_size_req)
+ tcb->rq_node.data = tcb;
+
+ /* Push node to ready queue */
+ sched_enqueue_task(tcb);
CRITICAL_LEAVE();
</file context>
| /* Pushback list node into list */ | ||
| static inline void list_pushback_node(list_t *list, list_node_t *target) | ||
| { | ||
| if (unlikely(!list || !target || target->next)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_pushback_node dereferences target->next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.
Prompt for AI agents
Address the following comment on include/lib/list.h at line 106:
<comment>list_pushback_node dereferences target->next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.</comment>
<file context>
@@ -100,6 +100,24 @@ static inline list_node_t *list_pushback(list_t *list, void *data)
+/* Pushback list node into list */
+static inline void list_pushback_node(list_t *list, list_node_t *target)
+{
+ if (unlikely(!list || !target || target->next))
+ return;
+
</file context>
fc2a929 to
73613fb
Compare
f2a35b3 to
846be25
Compare
Extends the core scheduler data structures to support the new O(1) scheduler design. Adds in tcb_t: - rq_node: embedded list node for ready-queue membership used during task state transitions. This avoids redundant malloc/free for per-enqueue/dequeue nodes by tying the node's lifetime to the task control block. Adds in kcb_t: - ready_bitmap: 8-bit bitmap tracking which priority levels have runnable tasks. - ready_queues[]: per-priority ready queues for O(1) task selection. - rr_cursors[]: round-robin cursor per priority level to support fair selection within the same priority. These additions are structural only and prepare the scheduler for O(1) ready-queue operations.
Previously, list_pushback() and list_remove() were the only list APIs available for data insertion into and removal from the list by malloc a new and free target linkage node. After the new data structure, rq_node, is added as the linkage node for ready queue operation purpose, there is no need to malloc and free each time. Add the insertion and removal list helpers without malloc and free on the linkage node. Both APIs will be applied in the dequeue and enqueue paths.
Previously, `sched_enqueue_task()` only marked task state as TASK_READY to represent the task has been enqueued due to the original scheduler selects the next task based on the global list and all tasks are kept in it. After new data structure, ready_queue[], is added for keeping runnable tasks, the enqueuing task API should push the embedded linkage list node, rq_node, into the corresponding ready_queue. This change aligns with the new task selection based on the ready queue.
Previously, sched_dequeue_task() was a no-op stub, which was sufficient when the scheduler selected tasks from the global list. Since new data structure, ready_queue, is added for keeping all runnable tasks, a dequeue path is required to remove tasks from ready queue to ensure it always holds runnable tasks. This change aligns with the non-runnable task should not in the ready queue.
Previously, task operation APIs such as sched_wakeup_task() only updated
the task state, which was sufficient when scheduling relied on the global
task list. With the scheduler now selecting runnable tasks from
ready_queue[] per priority level, state changes alone are insufficient.
To support the new scheduler and to prevent selection of tasks that have
already left the runnable set, explicit enqueue and dequeue paths are
required when task state transitions cross the runnable boundary:
In ready-queue set: {TASK_RUNNING, TASK_READY}
Not in ready-queue set: {all other states}
This change updates task operation APIs to include queue insertion and
removal logic according to their semantics. In general, queue operations
are performed by invoking existing helper functions sched_enqueue_task()
and sched_dequeue_task().
The modified APIs include:
- sched_wakeup_task(): avoid enqueueing a task that is already running
by treating TASK_RUNNING as part of the runnable set complement.
- mo_task_cancel(): dequeue TASK_READY tasks from ready_queue[] before
cancelling, ensuring removed tasks are not scheduled again.
- mo_task_delay(): runnable boundary transition only ("TASK_RUNNING →
TASK_BLOCKED"), no queue insertion for non-runnable tasks.
- mo_task_suspend(): supports both TASK_RUNNING and TASK_READY
("TASK_RUNNING/TASK_READY → TASK_SUSPENDED"), dequeue before suspend
when necessary.
- mo_task_resume(): only for suspended tasks ("TASK_SUSPENDED →
TASK_READY"), enqueue into ready_queue[] on resume.
- _sched_block(): runnable boundary transition only ("TASK_RUNNING →
TASK_BLOCKED"), dequeue without memory free.
This change keeps task state transition consistent to the ready queue
semantic.
Previously, task operations lacked public enqueue/dequeue helpers usable from outside the task core, which made it impossible to keep ready queue synchronized when tasks crossed the runnable boundary. This change introduces two helpers to be used by mutex and semaphore resource-wait and wakeup paths, ensuring their state transitions update ready queue explicitly and remain aligned with ready-queue semantics. Both helpers are prefixed with _sched_block to emphasize their role in TASK_BLOCKED-related transitions. This keeps mutex and semaphore APIs consistent with ready-queue semantics.
Invoke _sched_block_enqueue() and _sched_block_dequeue() helpers for all transitions into or out of TASK_BLOCKED state. This change keeps the scheduler ready queue and mutex/semaphore semantics aligned and consistent.
Once a task wakes up from a timed mutex lock, its state is already in TASK_RUNNING instead of TASK_BLOCKED. To determine whether the wakeup was triggered by the mutex or by a timeout, check whether the woken task is still present in the mutex waiter list. This change removes the incorrect TASK_BLOCKED-based condition check and replaces it with a waiter list check for timed mutex lock.
Previously, mo_task_priority() only updated the task’s time slice and priority level. With the new scheduler design, tasks are kept in per-priority ready queues, so mo_task_priority() must also handle migrating tasks between these queues. This change supports task migration from original ready queue to the new priority ready queue.
Move sched_enqueue_task() into a critical section to protect ready_queue[] integrity, as the API modifies shared scheduler resources. Initialize the embedded rq_node when a task spawns and set its next pointer to NULL to ensure deterministic linkage for ready-queue insertion. Bind the initial task slot using rq_node instead of the global task list, matching the new ready-queue selection model. This aligns task-spawn behavior with rq_node-based scheduling semantics.
Previously, the scheduler performed an O(N) scan of the global task list (kcb->tasks) to locate the next TASK_READY task. This resulted in non-deterministic selection latency and unstable round-robin rotation under heavy load or frequent task state transitions. This change introduces a strict O(1) scheduler based on per-priority ready queues and round-robin (RR) cursors. Each priority level maintains its own ready queue and cursor, enabling constant-time selection of the next runnable task while preserving fairness within the same priority.
This unit test cover task.c, mutex.c, and semaphore.c APIs that related to task state transition.
846be25 to
3e429f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 7 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="kernel/task.c">
<violation number="1" location="kernel/task.c:506">
P1: Array out-of-bounds access when `ready_bitmap` is 0. The loop exits with `top_prio_level = 8`, but arrays are sized 0-7. Add a bounds check or early panic when bitmap is empty.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
| /* Skip non-ready tasks */ | ||
| if (task->state != TASK_READY) | ||
| continue; | ||
| list_node_t **cursor = &kcb->rr_cursors[top_prio_level]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Array out-of-bounds access when ready_bitmap is 0. The loop exits with top_prio_level = 8, but arrays are sized 0-7. Add a bounds check or early panic when bitmap is empty.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At kernel/task.c, line 506:
<comment>Array out-of-bounds access when `ready_bitmap` is 0. The loop exits with `top_prio_level = 8`, but arrays are sized 0-7. Add a bounds check or early panic when bitmap is empty.</comment>
<file context>
@@ -443,53 +492,35 @@ uint16_t sched_select_next_task(void)
- /* Skip non-ready tasks */
- if (task->state != TASK_READY)
- continue;
+ list_node_t **cursor = &kcb->rr_cursors[top_prio_level];
+ list_t *rq = kcb->ready_queues[top_prio_level];
+ if (unlikely(!rq || !*cursor))
</file context>
| list_node_t **cursor = &kcb->rr_cursors[top_prio_level]; | |
| if (top_prio_level >= TASK_PRIORITY_LEVELS) | |
| panic(ERR_NO_TASKS); | |
| list_node_t **cursor = &kcb->rr_cursors[top_prio_level]; |
Summary
This PR introduces new data structures and a new O(1) scheduler to enhance the current O(n) RR scheduler, make tasks order deterministic, and execute in strictly priority order. At the same time, make context switch in constant time performance.
The new scheduler selects the next task in strict priority order, based on the per-priority RR cursor, which always points to the next task in each priority ready queue. The new data structure list node
rq_nodeis introduced intotcbto support task state transition and accompanies with task during the whole task lifecycle, which benefits without further malloc/free operation cost.Modified APIs and passed the unit test list:
task.cmo_task_spawn()mo_task_priority()mo_task_cancel()mo_task_suspend()sched_wakup_task(): Inner API but never used.kernel/semaphore.cmo_sem_signal()mo_sem_wait()kernel/mutex.cmo_mutex_lock()mo_mutex_unlock()mo_mutex_timedlock()mutex_cond_wait()mutex_cond_signal()mutex_cond_broadcast()mutex_cond_timedwait()Refer: #13
Changes
Section 1. New data structure for the new scheduler
rq_node: Embedded list node without malloc/free when task state transitions.ready_bitmap: 8-bit bitmap tracking which priority levels have runnable tasks.ready_queues[]: Per-priority ready queue, keeps all tasks with state TASK_RUNNING and TASK_READY.rr_cursors[]: Round-Robin cursor per priority level to support fair selection within the same priority, always points to the next task of the ready queue.Section 2. Ready-queue maintenance in task state transition APIs
Task state transition is now classified into two categories:
State-transition APIs now invoke enqueue/dequeue only when crossing between these two groups. The modification of the APIs follows the state transition diagram below.

Section 3. Refactor new scheduler logic
The new scheduler logic is refactored as:
kcb->task_currentto RR cursor of the scanned ready queue.Unit tests
The unit tests cover the following APIs and new data structures and can be divided into the following scopes:
1. Inner queue operation APIs
Check
sched_dequeue_task()andsched_enqueue_task()helpers operate properly on the ready queue and bitmap.Verification by calling the following APIs that include the above helpers.
mo_task_spawn(), check bit set up.2. Ready queue cursor
Check that the per-priority cursor behaves consistently across the whole system lifecycle with the following rules:
3. Task transition APIs (Refer to section 2. state transition diagram)
Validate general task state transition APIs, including
suspend,resume,spawn, anddelaywith three paths.Each state transition verifies that the task is in the ready queue or not, and the task count matches the rule of section 2.
4. Semaphore
Validate the task blocking path when using the semaphore API.
The task count in the ready queue and the waiting list are confirmed in points 4 and 9 in the following diagram:
Task count pair def: (in the ready queue, in the semaphore waiting list)
sequenceDiagram autonumber participant ControllerTask participant SemTask ControllerTask ->> ControllerTask: Take semaphore ControllerTask ->> SemTask: Yield CPU to sem_task SemTask ->> SemTask: Try to take the semaphore SemTask -->> ControllerTask: Fail and block itself (RUNNING->BLOCKED) ControllerTask ->> ControllerTask: Release semaphore ControllerTask ->> ControllerTask: Wakeup sem_task(BLOCKED->RUNNING) and check metadatamo_sem_signal()mo_sem_wait()5. Mutex
Validate the mutex lock operations APIs, separated into timeout and normal cases:
Task count pair def: (in the ready queue, in the mutex waiter list)
The following diagram illustrates the verified transition path:
sequenceDiagram autonumber participant ControllerTask participant MutexTask ControllerTask ->> ControllerTask: Lock mutex ControllerTask ->> MutexTask: Yield CPU MutexTask -->> ControllerTask: Lock fali: Running->Blocked ControllerTask ->> ControllerTask: Scheduler selects Controller ControllerTask ->> ControllerTask: Unlock mutex: Wake up mutex task TASK_BLOCK -> TASK_READYsequenceDiagram autonumber participant ControllerTask participant MutexTask ControllerTask ->> MutexTask: Yield CPU MutexTask->>ControllerTask: Timed mutex lock 10 ticks (RUNNING->BLOCKED), yield to the controller ControllerTask->>ControllerTask: Check state ControllerTask -->> MutexTask: Controller suspends itself MutexTask ->> MutexTask: Timeout and wakeup itself (BLOCKED->READY) MutexTask ->> ControllerTask: Yield to the controller ControllerTask->>ControllerTask: Check statemo_mutex_lock()mo_mutex_unlock()mo_mutex_timedlock()6. Condition (used with Mutex)
Validate the condition mutex lock operations APIs, separated into timeout and normal cases:
Task count pair def: (in the ready queue, in the mutex condition waiter list)
sequenceDiagram autonumber participant ControllerTask participant cond1 ControllerTask ->> ControllerTask: Create 3 mutex_cond tasks ControllerTask ->> condtasks: Controller yield to mutex_cond tasks condtasks->>condtasks: Mutex lock task 1 condtasks->>condtasks: Task 1 wait for condition variable (release mutex) condtasks->>condtasks: Mutex lock task 2 condtasks->>condtasks: Task 2 wait for condition variable (release mutex) condtasks->>condtasks: Mutex lock task 3 condtasks->>condtasks: Task 3 wait for condition variable (release mutex) condtasks->>ControllerTask: All cond_task are in the condition signal waiting list ControllerTask->>ControllerTask: Check state ControllerTask->>ControllerTask: Signal a condition variable ControllerTask->>ControllerTask: Check state ControllerTask->>ControllerTask: Broadcast ControllerTask->>ControllerTask: Check statesequenceDiagram autonumber participant ControllerTask participant timed_cond ControllerTask->>timed_cond: Yield to timed_cond task timed_cond->>timed_cond: timed_cond block(RUNNING->BLOCKED) timed_cond->>ControllerTask: Yield to the controller task ControllerTask->>ControllerTask: Check state ControllerTask-->>timed_cond: Suspend self and wait for timed_cond wakeup. timed_cond->>timed_cond: timed_cond timeout and wakeup (BLOCKED->READY) timed_cond->>ControllerTask: yield to the controller for the state check ControllerTask->>ControllerTask: Check statemutex_cond_wait()mutex_cond_signal()mutex_cond_broadcast()mutex_cond_timedwait()Benchmark
Approach
Start recording max schedule time after 3000ms, waking up end task(highest priority) after 40000ms (40s).
Result
Static data
Comparison of new and original scheduler

New scheduler performance

Reference
Benchmark - 45406d1
#13