System log: QRV v0.41–v0.43: The Lock-Free Era — and a Final Lesson from the Hardware

This is the last development post for QRV. The project set out to port QNX Neutrino 6.4 to RISC-V 64-bit, run it on real hardware, and explore what it would take to bring a clean microkernel architecture to a modern open ISA. Those goals are met. v0.43 is the final release.

v0.41: Intermediate Ground

v0.41 was never really a release — it was a tag on a working branch that happened to have the right shape at a particular moment. Many things didn't work. It served mainly as the foundation for the work that followed, so it gets a brief mention here for completeness.

The headline work in progress at that point: BKL removal on the IPC path. The sync kercalls (MsgSend, MsgReceive, MsgReply, MsgSendPulse) are the kernel's most frequently executed code — the entire message-passing architecture flows through them. Getting them out from under the Big Kernel Lock required a different class of care than the sync primitives that Phase 1 had covered.

The infrastructure assembled during the v0.41 sprint:

Per-object locks. tChannel grew channel_slock, tConnect grew connect_slock, and each tProcess got a vec_slock for its fdcons/chancons vector operations. A klock_debug that previously tracked lock classes was extended to track (class, instance) pairs — necessary once a single kercall legitimately touches two processes (ConnectAttach is the canonical case: client prpc and server prps) and needs to acquire both their vector locks without AB/BA deadlock. Same-instance recursive acquires remain forbidden; same-class different-instance acquires are now legal.

Safe Memory Reclamation (SMR). A minimal port of FreeBSD's smr(9) (BSD-2-Clause, Jeffrey Roberson). Per-CPU sequence counters, lock-free read sections via smr_enter/smr_exit, smr_synchronize to drain in-flight readers before reclaiming memory. Applied to kerext_process_destroy: after vector_rem removes the process from process_vector, smr_synchronize(&process_smr) waits for every in-flight reader that captured a prp pointer before the removal. Only then does the destroy path mutate or recycle prp's memory. Five kercalls (CLOCK_TIME, CLOCK_ID, SCHED_INFO, CONNECT_SERVER_INFO, CONNECT_FLAGS) converted to use lookup_pid_smr and open read sections around every prp dereference.

kprintf re-entrancy. A synchronous trap inside a kprintf call body would re-enter the trap handler, which itself calls kprintf. The inner kprintf_lock() would spin forever on the slock the outer call already held — a deadlock in the diagnostic path, which is the worst possible place for a deadlock. Fixed via a per-CPU in_kprintf[] nesting counter: the outermost caller acquires kprintf_slock; nested callers fall through and interleave their output. Acceptable; the alternative is silence.

SBI DBCN console. The legacy sbi_ecall_console_putc was a single-byte ECALL. Concurrent writers interleaved at byte granularity — [4] !UFLT!sc=c<binary>a)<binary>X... was a real crash log output, not a display artifact. Switched to SBI v2.0 DBCN_CONSOLE_WRITE, which writes an entire buffer in one ECALL, atomic at OpenSBI's M-mode handler. OpenSBI v1.6+ is now a hard requirement.

Four IPC kercalls BKL-less. After all the preparation: all four main IPC kercalls now run without holding the BKL. The races that the preparation closed:

A LINKPRIL_REM store-to-NULL in ker_msg_sendv when two CPUs probed the same thp from receive_queue in the gap between the probe and the START_SMP_XFER claim.
The same shape in ker_msg_replyv. Both fixed by claiming thp via QRV_ITF_MSG_DELIVERY atomically under channel_slock at the peek, holding the claim through LINKPRIL_REM, clearing it inside the channel lock window after removal.
WAIT_PENDING clear in ker_msg_receivev using a compiler barrier instead of cpu_smp_mb() — correct for x86 (total store order), wrong for RISC-V (RVWMO permits store-store reordering). Replaced with fence rw, rw.

v0.42: The Lock-Free Era

v0.42 is the squash of the full bkl-removal branch onto main. The per-commit history is preserved on the branch; the headline is what actually works.

The Big Kernel Lock is gone. Not reduced, not bypassed on a subset of kercalls — gone. The architecture that replaced it:

Fine-grained per-object locking: sync_slock, sched_slock, channel_slock, per-process vec_slock, per-channel chp->lock, per-connect cop->lock, intrevent_slock, clock_slock, alloc_slock. Each lock protects exactly its own subsystem's state. The lock-rank ordering enforced by klock_debug prevents deadlock by construction.

Lock-free message passing: MsgSend, MsgReceive, MsgReply, MsgSendPulse all run BKL-less. The channel queues are protected by channel_slock; the WAIT_PENDING flag bridges the transition window between queue insertion and state change; MSG_DELIVERY claims prevent concurrent wakers from touching the same thread simultaneously.

SMR-retired tThread / tConnect / tChannel: the Safe Memory Reclamation mechanism from v0.41 now covers process lookups. A thread can hold a reference to a tProcess across a context switch; the destroy path synchronizes before reclaiming, so no dangling pointer is possible.

Chain-keyed sync (FreeBSD umtx-style): the sync hash is keyed by (memobj, offset), partitioned per process. Priority-inheritance walks are best-effort under lock-free contention — correctness is preserved, priority boost quality degrades gracefully under high contention.

Self-reap thread/process teardown: dying threads and processes clean up their own resources rather than delegating to a separate terminator thread. The terminator-deadlock class that plagued early versions is structurally impossible.

Per-CPU ready queues: the scheduler dispatch path no longer touches a global queue on every context switch.

Atomic tThread.internal_flags: the final race, found at v0.42-rc12.

The final race. 8 CPUs, 300 pidin processes spawning and exiting in a loop. 15 boots clean. One stalls: the heartbeat goes silent, gdb shows a sender spinning forever at ker_fastmsg.c:445 on thp->internal_flags & QRV_ITF_WAIT_PENDING.

The value of internal_flags in the stuck thread: 0x21. That is MSG_DELIVERY (bit 0, set by the sender as a claim) plus WAIT_PENDING (bit 5, which should have been cleared by the receiver).

What happened: the receiver read internal_flags, computed flags & ~0x20, and was about to store 0x01. The sender concurrently read 0x20 (before the receiver's clear), computed flags | 0x01, and stored 0x21. The receiver's store of 0x01 arrived first; the sender's store of 0x21 landed second, resurrecting the exact bit the sender was spinning on. A classic lost-update — one word, two harts, no atomic.

The BKL had been providing the atomicity that the plain |=/&=~ relied on. Removing it didn't introduce the race. It revealed it.

Fix: ITF_SET/ITF_CLR macros wrapping atomic_or/atomic_clr (RISC-V amoor.w/amoand.w). 55 RMW sites converted across ker_fastmsg.c, ker_message.c, ker_net.c, ker_channel.c, and the nano subsystems. Read-only tests (& bit) remain plain aligned loads.

Result: 16/16 -P8 boots reach PASS: 300 pidins + login:, zero stalls.

Taskman in U-mode on real hardware. Three deterministic fixes to get U-mode taskman running on the SiFive FU740:

The modpkg_elf.c ELF loader built segment and stack PTEs without PTE_A and PTE_D. The U74 uses trap-based A/D management — it faults on first access to a PTE with PTE_A clear. QEMU sets these bits in hardware, so the bug was invisible there. U-mode taskman faulted on its very first instruction fetch on real silicon. Fix: set PTE_A on every U-mode leaf PTE and PTE_D on writable ones (5 sites in the loader), mirroring the startup MMU code that had the same fix applied in v0.21.

The rsrcdbmgr HWINFO bug: reserve_ranges() was passing hwi_location.addrspace (a HWI_ADDRSPACE_* kind enum: 0=io, 1=memory) into get_as_type(), which expected a byte offset into the asinfo section. A kind of 1 dereferenced a phantom entry at asinfo_base+1; the garbage offset fed a wild strcmp. On QEMU the garbage at base+1 happened to be a valid string offset. On real hardware it faulted. Every UART and PCI device range was silently skipped in rsrcdbmgr; the fix makes them correctly reserved as MEMORY resources.

A dbgprintf PLT relocation: dbgprintf was in the loader's ELF as a PLT stub (an imported symbol). During early U-mode taskman boot, before the dynamic linker had resolved the PLT, any call to dbgprintf jumped into uninitialized memory. Made it self-contained — no PLT, direct reference, same pattern as every other early-boot diagnostic function.

After all three: U-mode taskman on the Unmatched boots through the syscall conformance suite (89/89), enumerates the Samsung NVMe, mounts /usr from the QRV partition, and reaches the login prompt.

v0.43: The Final Lesson

v0.43 is QRV's last release. It closes a bug that had been open since the U-mode taskman work began on real hardware, and it closes it with a lesson worth writing down.

"The crash that wore many faces"

After v0.42 was working on the Unmatched, a stress test was run: 30 consecutive pidin spawns and exits. The results were consistent: clean for a few rounds, then a fault of some kind. But not the same fault. Sometimes an illegal instruction. Sometimes a load page fault at a user address. Sometimes the process jumped to what looked like garbage code. Sometimes sepc pointed into the middle of a function, with no plausible call path to get there.

The initial hypothesis was memory corruption — a write escaping one address space and landing in another's code pages. This was a reasonable guess: the system had been through a complex U-mode migration, the page table code had bugs before. The Porting Story chapter documenting the bug hunt described the investigation in detail, including the places it confidently looked and didn't find the cause.

The actual cause was simpler and more fundamental.

RISC-V I-cache Coherence

RISC-V provides no coherence between data stores and instruction fetch. When you write bytes to a memory region and then execute them, you must explicitly flush them from the data cache and invalidate the instruction cache. The relevant instruction is fence.i.

QRV was not doing this. When the ELF loader wrote a program's instructions into freshly allocated pages, it performed no cache maintenance. The pages were clean in the data cache; the instruction cache on each hart still held lines from whatever had previously occupied those physical pages. A newly spawned process — running on a hart that had recently executed something else in those pages — would fetch stale instructions, branch to garbage addresses, and produce exactly the random-looking fault signatures that had been appearing.

The reason this was invisible on QEMU: QEMU models no instruction cache. Every fetch sees the current memory contents.

The reason -P 1 was clean: with a single hart, the loader and the new process share the same hart. The fence.i implicit in the context switch (which flushes the pipeline) is sufficient for single-hart coherence.

The reason the fault appeared "after a few rounds" rather than immediately: page reuse. The allocator returns fresh pages the first time; recycled pages carry the previous tenant's cache lines.

The Fix

cpu_icache_sync_all(): execute a local fence.i and then broadcast an SBI remote fence.i to all other harts. Called from two sites: mm_reference when an executable page is initialized, and pageman_mprotect at the writable→exec transition that every ELF loader uses.

The call goes through a new kerext __KER_TM_ICACHE_SYNC (since taskman now runs in U-mode and cannot execute privileged instructions directly).

Result on the Unmatched: ~640 consecutive spawns without a fault. The stress test that previously failed within 30 iterations now runs to completion.

A companion fix: __cpu_membarrier, defined in cpu_inline.h, had been a bare nop on RISC-V while x86_64 emitted mfence. This was a latent inconsistency in the portable memory-ordering abstraction — the name implied a hardware barrier, but RISC-V was getting a compiler barrier only. Fixed to emit fence rw, rw on RISC-V, consistent with the intent and with the existing cpu_smp_mb().

Closing

This is the last release of QRV. The goals set at the start are met:

A QNX Neutrino 6.4 microkernel, ported from the original 32-bit ILP32 sources to 64-bit LP64, running on RISC-V. SMP with fine-grained locking and no global kernel lock. Taskman in U-mode with hardware privilege separation. Booting from a real Samsung NVMe drive on a real RISC-V workstation. Multi-user interactive login with SHA-512 authentication. 89 syscall conformance tests passing on real silicon.

From a blank slate on February 27, 2026 — or from a deeper beginning in December 2020, or from RadiOS in 1998, depending on where you count the start — to a running system. The partition that was left empty in July 2021 has been /usr for a while now.

The work QRV began does not stop here. A new project is underway, built on the same architectural foundations, aimed at providing a fully free and open QNX-compatible operating system for RISC-V. Its name will be announced in due course.

System log

Tuesday, May 26, 2026

QRV v0.41–v0.43: The Lock-Free Era — and a Final Lesson from the Hardware