In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Friday, May 01, 2026

QRV v0.28–v0.33: Breaking Up the Big Kernel Lock

Two ambitious project branches emerged from the v0.27 milestone. The first is removing the Big Kernel Lock. The second — lifting taskman to user mode — gets its own post. This one covers v0.28 through v0.33: the foundation work, the first measurements, and the first kercall to run without holding the BKL.


v0.28 — Phase 0: Infrastructure and Baselines

Before any lock can be removed, you need to know what you're removing. Phase 0 landed two things: the structural plumbing for fine-grained locking, and the instrumentation to measure the current BKL cost.

Per-CPU infrastructure. kernel/include/percpu.h introduces DECLARE_PER_CPU / DEFINE_PER_CPU / this_cpu_{ptr,read,write,inc,dec} — thin macros over plain [PROCESSORS_MAX] arrays indexed by RISC-V hart ID. Alongside it: kernel/include/klock.h defining kspinlock_t, a lock-class rank ordering (KLOCK_CLASS_SCHED, KLOCK_CLASS_SYNC, KLOCK_CLASS_CHANNEL, etc.) for a future lockdep, and a per-CPU preempt_count word.

am_inkernel() triage. The twelve call sites of am_inkernel() turned out to express two different things. bkl_held() means this CPU holds the BKL. in_kernel_context() means this CPU is in a kernel critical section of any kind. Today both have the same body — but once Phase 1 introduces per-subsystem locks they will genuinely diverge.

bkl-stress and the first baseline. A synthetic stresser that spawns one pthread per hart and hammers one of three workloads: mutex (every lock/unlock hits SyncMutexLock/SyncMutexUnlock), sleep (1 ms nanosleep loops), or nop (pure user-mode, control). Results from a five-second QEMU -smp 4 run:

boot baseline:              34.6 % BKL contention rate
bkl-stress sleep 5s:        58.6 % contention, spread across 4 harts
bkl-stress mutex 5s:        71 %  contention, ~2.17 M iter/s

Those numbers are the target. Phase 1 will quote against them.

BKL instrumentation. kernel/bkl_stats.c adds per-CPU counters (acquires, contended acquires, spin cycles, held cycles, log2 histogram of spin durations) around acquire_kernel / release_kernel, all gated behind CONFIG_BKL_STATS (default off). The taskman sysfs grew a synth() callback so /sys/bkl can snapshot the kernel counters at any open. When the config is off, the hot path is byte-identical to v0.27.

First measured baseline on a fresh boot: ~28,800 BKL acquisitions across four harts, 34.6 % contention rate, spin distribution peaking in the 2^14–2^17 cycle range. That is roughly 9.8 kcyc per BKL critical section on average.


v0.29 — Phase 1.1–1.2: Fine-Grained Locks Declared

Two structural locks declared, neither yet contended — that is the deliberate pattern.

sync_slock (Phase 1.1) wraps every read and write of sync_hash and the tSync chains hanging off it. Every sync kercall still holds the BKL around the entire operation, so sync_slock is uncontended and idempotent today. It exists to tag every sync-hash touchpoint so that Phase 1.5 can lift the BKL from sync kercalls without re-auditing every call site. One notable exception: synchash_rem was left BKL-only, because its inner unlock_kernel/lock_kernel cycle to safely probe a user-space sync->owner makes a naive spinlock wrap unbounded.

sched_slock (Phase 1.2) declared alongside sync_slock. The scheduler primitives (force_ready, ready_default, block, unready) compose more aggressively than the sync primitives — force_ready calls ready, unready calls block, block calls select_thread via FIND_HIGHEST — and ready_default alone has five early-return paths. Threading a lock through all of that without knowing exactly which paths will run BKL-less is premature. The lock is there; the wrapping waits for Phase 1.5.

CONFIG_KLOCK_DEBUG. An embryonic lockdep: kspinlock_t gains a class field; kspinlock_lock checks that the new class ranks strictly above every class on the per-CPU held-stack. On violation: kprintf + crash(). Default off; when on, a bkl-stress run over Phase 1.1 produces zero violations.


v0.30 — Phase 1.5a–e: First BKL Lift

The distance between "lock declared" and "BKL removed from a kercall" is a series of preparatory steps that each need to be correct before the next one is safe.

P1.5a. synchash_rem wrapped in sync_slock, dropping the lock around the user-space sync->owner probe. The prerequisite that had been deferred from Phase 1.1.

P1.5b. sync_wakeup, mutex_holdlist_add, and mutex_holdlist_rem now take sync_slock internally around their wait-list mutations. They drop the lock across ready() calls — SCHED < SYNC in the rank order, so calling a scheduler primitive with a sync lock held would be a rank inversion.

P1.5c. ready_default and adjust_priority_default wrapped in sched_slock. Five exit paths, drop-and-retake around calls that recurse.

P1.5d. force_ready's STATE_MUTEX block and adjust_priority's STATE_CONDVAR/SEM/MUTEX branches now take sync_slock around their wait-list mutations. A _locked variant of mutex_holdlist_rem avoids recursive acquire.

P1.5e — the first lift. __KER_SYNC_MUTEX_UNLOCK marked in kercall_no_bkl. __kercall_dispatch calls release_kernel() after entry-time bookkeeping, runs the handler, and re-acquires before the tail. The body of sync_mutex_unlock was restructured: *owp = 0 stores switched to _smp_xchg_ul(owp, ...) (amoswap.d.aqrl) for acquire+release semantics; every wait-list read under sync_slock.

A bug caught in passing: atomic_set_64(owp, value) emits amoor.d (bit-set, not assign). ORing zero into *owp is a no-op — the mutex never released, system hung. The correct primitive is _smp_xchg_ul. This discovery also led to a tree-wide rename: atomic_setatomic_or everywhere, to match actual semantics.

Results from P1.5e:


P1.5d (BKL) P1.5e (lifted)
Acquires 1.94 M 1.42 M
Contended 1.55 M (80 %) 1.35 M (95 %)
Spin cycles 8.51 Gcyc 4.18 Gcyc (−51 %)
Held cycles 23.5 Gcyc 12.25 Gcyc (−48 %)
Held cyc/acquire 12.1 kcyc 8.6 kcyc (−29 %)

Half the BKL spin time and half the BKL held time. That is the first real measurement of what BKL removal is worth.


v0.30 (continued) — Phases 1.5f–m: All Sync Kercalls Lifted

The remaining six sync kercalls followed in sequence.

P1.5f lifted __KER_SYNC_CONDVAR_SIGNAL. Body is a single sync_wakeup call, which was already self-locking since P1.5b. Straightforward.

P1.5g wrapped block, unready, and block_and_ready in sched_slock. These three are the primitives the upcoming condvar/sem/mutex lifts need.

P1.5h wrapped the standalone pril_add/pril_rem call sites in ker_sync.c in sync_slock. Prerequisite for the block-path lifts.

P1.5i introduced QRV_ITF_WAIT_PENDING — a per-thread flag set during the transition window between adding a thread to a wait list and completing the unready. Wakers on other CPUs spin-wait until the flag clears. This is the primitive that makes the remaining lifts safe: without it, a thread can be on the wait list while still STATE_RUNNING, and a concurrent ready() would CRASHCHECK.

P1.5j–k lifted sem_post and sem_wait. One bug surfaced in sem_wait: block() (called inside unready) clears thp->blocked_on = NULL in its default case. The first draft of the body assigned blocked_on = syp before unready, which got clobbered. Fix: assign after unready, with WAIT_PENDING bridging the window.

P1.5l lifted condvar_wait. The coupling between kmutex_t and condvar was the hard part: kmutex_t never enters sync_hash (libc's cmpxchg fast-path short-circuits the kernel), so sync_lookup's auto-init failed after a wakeup on a system without the BKL. Fix: force-register the bound mutex in sync_hash from the condvar_wait body before releasing it.

P1.5m lifted mutex_lock and mutex_revive — the last two. The priority-inheritance walk through mutex_lock is the most complex body in the sync subsystem. It stays "best-effort" under contention: stale reads in the walk degrade priority inheritance but don't break correctness. The canonical waiter pattern from P1.5i applies here too.

Hardware fences. An important correction mid-lift: the inline __asm__ volatile("" ::: "memory") barriers in the WAIT_PENDING handshake were compiler-only. RVWMO permits store-store and load-store reordering at hardware level. Replaced with cpu_smp_mb() — a portable abstraction defined per-arch as fence rw, rw on RISC-V and mfence on x86_64.

Phase 1.5 exits with all seven sync kercalls running BKL-less.


v0.31 — PRIL/LINK3 Alias Corruption

After Phase 1.5 completed, a rare but real crash surfaced during devb-nvme bringup:

*** PRIL CORRUPTION: HEAD prev.prio_tail priority mismatch ***
CRASH: nano/nano_misc.c:328

Root cause: PRIL_ENTRY_FIELDS puts a union at the start of tThread. The same first 16 bytes serve as either a PRIL wait-list (next.pril, prev.prio_tail) or a LINK3 dispatch list (next.thread, prev.thread). A thread cannot legally be on both simultaneously.

The P1.5l-lifted ker_sync_condvar_wait does: pril_add (puts act on wait list), then sync_mutex_unlock (which calls ready(waiter)ready_default, which in the SMP "can replace this CPU" path calls LINK3_BEG(dispatch_queue, act, tThread) — writing act->next.thread and act->prev.thread, the same bytes as the PRIL fields just set). The prev.prio_tail that should equal act (for a HEAD+TAIL single-element list) now holds a stale dispatch-queue backlink.

Fix: in ready_default, before bumping act, check QRV_ITF_WAIT_PENDING. If set, fall through — act will finish its own transition and ker_exit will pick it up. The bump is skipped.

Verification: 20/20 clean boots, zero PRIL corruption.


v0.32–v0.33 — Channel Lock and Trace Ring

v0.32 began Phase 2 of BKL removal: the IPC path. channel_slock declared (rank between INTREVENT and CONNECT). The wait-list primitives in nano_message.c (remove_unblock, net_send2, net_sendmsg) and the channel-destroy and pulse paths in nano_pulse.c and ker_channel.c now take channel_slock around their queue mutations. All of this is still uncontended — the BKL is still held by every caller. The four target IPC kercalls (MsgSend, MsgSendPulse, MsgReceive, MsgReply) got their queue-op sites wrapped and WAIT_PENDING spin-waits inserted in preparation for the eventual BKL lift.

v0.33 added a lockless per-CPU function-entry trace ring: 32 entries per CPU, a TRACE_FN(id, ctx, arg) macro, and a trace_ring_dump(cpu) called from the crash path. During the BKL hunt this was the tool that revealed how the __KER_MSG_RECEIVEV canonical waiter pattern interacted with the timer-trap-handler's in_kernel_context() routing to leak BSS globals into active thread register save areas. The ring stays in the tree — it will be useful again.


Where Things Stand

Seven sync kercalls run BKL-less. Phase 2 infrastructure is in place for the IPC path. The BKL is not gone — that was never the claim — but the structural work to remove it section by section is well underway, documented, and measured.

No comments: