Two ambitious project branches emerged from the v0.27 milestone. The first is removing the Big Kernel Lock. The second — lifting taskman to user mode — gets its own post. This one covers v0.28 through v0.33: the foundation work, the first measurements, and the first kercall to run without holding the BKL.
v0.28 — Phase 0: Infrastructure and Baselines
Before any lock can be removed, you need to know what you're removing. Phase 0 landed two things: the structural plumbing for fine-grained locking, and the instrumentation to measure the current BKL cost.
Per-CPU infrastructure. kernel/include/percpu.h introduces
DECLARE_PER_CPU / DEFINE_PER_CPU / this_cpu_{ptr,read,write,inc,dec} —
thin macros over plain [PROCESSORS_MAX] arrays indexed by RISC-V hart ID.
Alongside it: kernel/include/klock.h defining kspinlock_t, a lock-class
rank ordering (KLOCK_CLASS_SCHED, KLOCK_CLASS_SYNC, KLOCK_CLASS_CHANNEL,
etc.) for a future lockdep, and a per-CPU preempt_count word.
am_inkernel() triage. The twelve call sites of am_inkernel() turned
out to express two different things. bkl_held() means this CPU holds the
BKL. in_kernel_context() means this CPU is in a kernel critical section of
any kind. Today both have the same body — but once Phase 1 introduces
per-subsystem locks they will genuinely diverge.
bkl-stress and the first baseline. A synthetic stresser that spawns one
pthread per hart and hammers one of three workloads: mutex (every
lock/unlock hits SyncMutexLock/SyncMutexUnlock), sleep (1 ms nanosleep
loops), or nop (pure user-mode, control). Results from a five-second QEMU
-smp 4 run:
boot baseline: 34.6 % BKL contention rate
bkl-stress sleep 5s: 58.6 % contention, spread across 4 harts
bkl-stress mutex 5s: 71 % contention, ~2.17 M iter/s
Those numbers are the target. Phase 1 will quote against them.
BKL instrumentation. kernel/bkl_stats.c adds per-CPU counters
(acquires, contended acquires, spin cycles, held cycles, log2 histogram of
spin durations) around acquire_kernel / release_kernel, all gated behind
CONFIG_BKL_STATS (default off). The taskman sysfs grew a synth() callback
so /sys/bkl can snapshot the kernel counters at any open. When the config
is off, the hot path is byte-identical to v0.27.
First measured baseline on a fresh boot: ~28,800 BKL acquisitions across four harts, 34.6 % contention rate, spin distribution peaking in the 2^14–2^17 cycle range. That is roughly 9.8 kcyc per BKL critical section on average.
v0.29 — Phase 1.1–1.2: Fine-Grained Locks Declared
Two structural locks declared, neither yet contended — that is the deliberate pattern.
sync_slock (Phase 1.1) wraps every read and write of sync_hash and
the tSync chains hanging off it. Every sync kercall still holds the BKL
around the entire operation, so sync_slock is uncontended and idempotent
today. It exists to tag every sync-hash touchpoint so that Phase 1.5 can
lift the BKL from sync kercalls without re-auditing every call site. One
notable exception: synchash_rem was left BKL-only, because its inner
unlock_kernel/lock_kernel cycle to safely probe a user-space sync->owner
makes a naive spinlock wrap unbounded.
sched_slock (Phase 1.2) declared alongside sync_slock. The scheduler
primitives (force_ready, ready_default, block, unready) compose more
aggressively than the sync primitives — force_ready calls ready, unready
calls block, block calls select_thread via FIND_HIGHEST — and
ready_default alone has five early-return paths. Threading a lock through all
of that without knowing exactly which paths will run BKL-less is premature.
The lock is there; the wrapping waits for Phase 1.5.
CONFIG_KLOCK_DEBUG. An embryonic lockdep: kspinlock_t gains a class
field; kspinlock_lock checks that the new class ranks strictly above every
class on the per-CPU held-stack. On violation: kprintf + crash(). Default
off; when on, a bkl-stress run over Phase 1.1 produces zero violations.
v0.30 — Phase 1.5a–e: First BKL Lift
The distance between "lock declared" and "BKL removed from a kercall" is a series of preparatory steps that each need to be correct before the next one is safe.
P1.5a. synchash_rem wrapped in sync_slock, dropping the lock around
the user-space sync->owner probe. The prerequisite that had been deferred
from Phase 1.1.
P1.5b. sync_wakeup, mutex_holdlist_add, and mutex_holdlist_rem now
take sync_slock internally around their wait-list mutations. They drop the
lock across ready() calls — SCHED < SYNC in the rank order, so calling a
scheduler primitive with a sync lock held would be a rank inversion.
P1.5c. ready_default and adjust_priority_default wrapped in
sched_slock. Five exit paths, drop-and-retake around calls that recurse.
P1.5d. force_ready's STATE_MUTEX block and adjust_priority's
STATE_CONDVAR/SEM/MUTEX branches now take sync_slock around their
wait-list mutations. A _locked variant of mutex_holdlist_rem avoids
recursive acquire.
P1.5e — the first lift. __KER_SYNC_MUTEX_UNLOCK marked in
kercall_no_bkl. __kercall_dispatch calls release_kernel() after
entry-time bookkeeping, runs the handler, and re-acquires before the tail.
The body of sync_mutex_unlock was restructured: *owp = 0 stores switched
to _smp_xchg_ul(owp, ...) (amoswap.d.aqrl) for acquire+release semantics;
every wait-list read under sync_slock.
A bug caught in passing: atomic_set_64(owp, value) emits amoor.d
(bit-set, not assign). ORing zero into *owp is a no-op — the mutex never
released, system hung. The correct primitive is _smp_xchg_ul. This
discovery also led to a tree-wide rename: atomic_set → atomic_or
everywhere, to match actual semantics.
Results from P1.5e:
| P1.5d (BKL) | P1.5e (lifted) | |
|---|---|---|
| Acquires | 1.94 M | 1.42 M |
| Contended | 1.55 M (80 %) | 1.35 M (95 %) |
| Spin cycles | 8.51 Gcyc | 4.18 Gcyc (−51 %) |
| Held cycles | 23.5 Gcyc | 12.25 Gcyc (−48 %) |
| Held cyc/acquire | 12.1 kcyc | 8.6 kcyc (−29 %) |
Half the BKL spin time and half the BKL held time. That is the first real measurement of what BKL removal is worth.
v0.30 (continued) — Phases 1.5f–m: All Sync Kercalls Lifted
The remaining six sync kercalls followed in sequence.
P1.5f lifted __KER_SYNC_CONDVAR_SIGNAL. Body is a single sync_wakeup
call, which was already self-locking since P1.5b. Straightforward.
P1.5g wrapped block, unready, and block_and_ready in sched_slock.
These three are the primitives the upcoming condvar/sem/mutex lifts need.
P1.5h wrapped the standalone pril_add/pril_rem call sites in
ker_sync.c in sync_slock. Prerequisite for the block-path lifts.
P1.5i introduced QRV_ITF_WAIT_PENDING — a per-thread flag set during
the transition window between adding a thread to a wait list and completing
the unready. Wakers on other CPUs spin-wait until the flag clears. This is
the primitive that makes the remaining lifts safe: without it, a thread can
be on the wait list while still STATE_RUNNING, and a concurrent ready()
would CRASHCHECK.
P1.5j–k lifted sem_post and sem_wait. One bug surfaced in sem_wait:
block() (called inside unready) clears thp->blocked_on = NULL in its
default case. The first draft of the body assigned blocked_on = syp before
unready, which got clobbered. Fix: assign after unready, with
WAIT_PENDING bridging the window.
P1.5l lifted condvar_wait. The coupling between kmutex_t and condvar
was the hard part: kmutex_t never enters sync_hash (libc's cmpxchg
fast-path short-circuits the kernel), so sync_lookup's auto-init failed
after a wakeup on a system without the BKL. Fix: force-register the bound
mutex in sync_hash from the condvar_wait body before releasing it.
P1.5m lifted mutex_lock and mutex_revive — the last two. The
priority-inheritance walk through mutex_lock is the most complex body in
the sync subsystem. It stays "best-effort" under contention: stale reads in
the walk degrade priority inheritance but don't break correctness. The
canonical waiter pattern from P1.5i applies here too.
Hardware fences. An important correction mid-lift: the inline
__asm__ volatile("" ::: "memory") barriers in the WAIT_PENDING handshake
were compiler-only. RVWMO permits store-store and load-store reordering at
hardware level. Replaced with cpu_smp_mb() — a portable abstraction defined
per-arch as fence rw, rw on RISC-V and mfence on x86_64.
Phase 1.5 exits with all seven sync kercalls running BKL-less.
v0.31 — PRIL/LINK3 Alias Corruption
After Phase 1.5 completed, a rare but real crash surfaced during devb-nvme bringup:
*** PRIL CORRUPTION: HEAD prev.prio_tail priority mismatch ***
CRASH: nano/nano_misc.c:328
Root cause: PRIL_ENTRY_FIELDS puts a union at the start of tThread. The
same first 16 bytes serve as either a PRIL wait-list (next.pril,
prev.prio_tail) or a LINK3 dispatch list (next.thread, prev.thread). A
thread cannot legally be on both simultaneously.
The P1.5l-lifted ker_sync_condvar_wait does: pril_add (puts act on wait
list), then sync_mutex_unlock (which calls ready(waiter) → ready_default,
which in the SMP "can replace this CPU" path calls LINK3_BEG(dispatch_queue, act, tThread) — writing act->next.thread and act->prev.thread, the same
bytes as the PRIL fields just set). The prev.prio_tail that should equal
act (for a HEAD+TAIL single-element list) now holds a stale dispatch-queue
backlink.
Fix: in ready_default, before bumping act, check QRV_ITF_WAIT_PENDING.
If set, fall through — act will finish its own transition and ker_exit
will pick it up. The bump is skipped.
Verification: 20/20 clean boots, zero PRIL corruption.
v0.32–v0.33 — Channel Lock and Trace Ring
v0.32 began Phase 2 of BKL removal: the IPC path. channel_slock
declared (rank between INTREVENT and CONNECT). The wait-list primitives
in nano_message.c (remove_unblock, net_send2, net_sendmsg) and the
channel-destroy and pulse paths in nano_pulse.c and ker_channel.c now
take channel_slock around their queue mutations. All of this is still
uncontended — the BKL is still held by every caller. The four target IPC
kercalls (MsgSend, MsgSendPulse, MsgReceive, MsgReply) got their
queue-op sites wrapped and WAIT_PENDING spin-waits inserted in preparation
for the eventual BKL lift.
v0.33 added a lockless per-CPU function-entry trace ring: 32 entries per
CPU, a TRACE_FN(id, ctx, arg) macro, and a trace_ring_dump(cpu) called
from the crash path. During the BKL hunt this was the tool that revealed how
the __KER_MSG_RECEIVEV canonical waiter pattern interacted with the
timer-trap-handler's in_kernel_context() routing to leak BSS globals into
active thread register save areas. The ring stays in the tree — it will be
useful again.
Where Things Stand
Seven sync kercalls run BKL-less. Phase 2 infrastructure is in place for the IPC path. The BKL is not gone — that was never the claim — but the structural work to remove it section by section is well underway, documented, and measured.
No comments:
Post a Comment