This is the last development post for QRV. The project set out to port QNX Neutrino 6.4 to RISC-V 64-bit, run it on real hardware, and explore what it would take to bring a clean microkernel architecture to a modern open ISA. Those goals are met. v0.43 is the final release.
v0.41: Intermediate Ground
v0.41 was never really a release — it was a tag on a working branch that happened to have the right shape at a particular moment. Many things didn't work. It served mainly as the foundation for the work that followed, so it gets a brief mention here for completeness.
The headline work in progress at that point: BKL removal on the IPC path.
The sync kercalls (MsgSend, MsgReceive, MsgReply, MsgSendPulse) are
the kernel's most frequently executed code — the entire message-passing
architecture flows through them. Getting them out from under the Big Kernel
Lock required a different class of care than the sync primitives that Phase 1
had covered.
The infrastructure assembled during the v0.41 sprint:
Per-object locks. tChannel grew channel_slock, tConnect grew
connect_slock, and each tProcess got a vec_slock for its
fdcons/chancons vector operations. A klock_debug that previously tracked
lock classes was extended to track (class, instance) pairs — necessary once
a single kercall legitimately touches two processes (ConnectAttach is the
canonical case: client prpc and server prps) and needs to acquire both
their vector locks without AB/BA deadlock. Same-instance recursive acquires
remain forbidden; same-class different-instance acquires are now legal.
Safe Memory Reclamation (SMR). A minimal port of FreeBSD's smr(9)
(BSD-2-Clause, Jeffrey Roberson). Per-CPU sequence counters, lock-free
read sections via smr_enter/smr_exit, smr_synchronize to drain
in-flight readers before reclaiming memory. Applied to
kerext_process_destroy: after vector_rem removes the process from
process_vector, smr_synchronize(&process_smr) waits for every in-flight
reader that captured a prp pointer before the removal. Only then does the
destroy path mutate or recycle prp's memory. Five kercalls
(CLOCK_TIME, CLOCK_ID, SCHED_INFO, CONNECT_SERVER_INFO,
CONNECT_FLAGS) converted to use lookup_pid_smr and open read sections
around every prp dereference.
kprintf re-entrancy. A synchronous trap inside a kprintf call body
would re-enter the trap handler, which itself calls kprintf. The inner
kprintf_lock() would spin forever on the slock the outer call already
held — a deadlock in the diagnostic path, which is the worst possible
place for a deadlock. Fixed via a per-CPU in_kprintf[] nesting counter:
the outermost caller acquires kprintf_slock; nested callers fall through
and interleave their output. Acceptable; the alternative is silence.
SBI DBCN console. The legacy sbi_ecall_console_putc was a single-byte
ECALL. Concurrent writers interleaved at byte granularity — [4] !UFLT!sc=c<binary>a)<binary>X... was a real crash log output,
not a display artifact. Switched to SBI v2.0 DBCN_CONSOLE_WRITE, which
writes an entire buffer in one ECALL, atomic at OpenSBI's M-mode handler.
OpenSBI v1.6+ is now a hard requirement.
Four IPC kercalls BKL-less. After all the preparation: all four main IPC kercalls now run without holding the BKL. The races that the preparation closed:
- A
LINKPRIL_REMstore-to-NULL inker_msg_sendvwhen two CPUs probed the samethpfromreceive_queuein the gap between the probe and theSTART_SMP_XFERclaim. - The same shape in
ker_msg_replyv. Both fixed by claimingthpviaQRV_ITF_MSG_DELIVERYatomically underchannel_slockat the peek, holding the claim throughLINKPRIL_REM, clearing it inside the channel lock window after removal. WAIT_PENDINGclear inker_msg_receivevusing a compiler barrier instead ofcpu_smp_mb()— correct for x86 (total store order), wrong for RISC-V (RVWMO permits store-store reordering). Replaced withfence rw, rw.
v0.42: The Lock-Free Era
v0.42 is the squash of the full bkl-removal branch onto main. The
per-commit history is preserved on the branch; the headline is what
actually works.
The Big Kernel Lock is gone. Not reduced, not bypassed on a subset of kercalls — gone. The architecture that replaced it:
Fine-grained per-object locking: sync_slock, sched_slock,
channel_slock, per-process vec_slock, per-channel chp->lock,
per-connect cop->lock, intrevent_slock, clock_slock, alloc_slock.
Each lock protects exactly its own subsystem's state. The lock-rank ordering
enforced by klock_debug prevents deadlock by construction.
Lock-free message passing: MsgSend, MsgReceive, MsgReply,
MsgSendPulse all run BKL-less. The channel queues are protected by
channel_slock; the WAIT_PENDING flag bridges the transition window
between queue insertion and state change; MSG_DELIVERY claims prevent
concurrent wakers from touching the same thread simultaneously.
SMR-retired tThread / tConnect / tChannel: the Safe Memory Reclamation
mechanism from v0.41 now covers process lookups. A thread can hold a
reference to a tProcess across a context switch; the destroy path
synchronizes before reclaiming, so no dangling pointer is possible.
Chain-keyed sync (FreeBSD umtx-style): the sync hash is keyed by
(memobj, offset), partitioned per process. Priority-inheritance walks are
best-effort under lock-free contention — correctness is preserved, priority
boost quality degrades gracefully under high contention.
Self-reap thread/process teardown: dying threads and processes clean up their own resources rather than delegating to a separate terminator thread. The terminator-deadlock class that plagued early versions is structurally impossible.
Per-CPU ready queues: the scheduler dispatch path no longer touches a global queue on every context switch.
Atomic tThread.internal_flags: the final race, found at v0.42-rc12.
The final race. 8 CPUs, 300 pidin processes spawning and exiting in a
loop. 15 boots clean. One stalls: the heartbeat goes silent, gdb shows a
sender spinning forever at ker_fastmsg.c:445 on
thp->internal_flags & QRV_ITF_WAIT_PENDING.
The value of internal_flags in the stuck thread: 0x21. That is
MSG_DELIVERY (bit 0, set by the sender as a claim) plus
WAIT_PENDING (bit 5, which should have been cleared by the receiver).
What happened: the receiver read internal_flags, computed flags & ~0x20,
and was about to store 0x01. The sender concurrently read 0x20 (before
the receiver's clear), computed flags | 0x01, and stored 0x21. The
receiver's store of 0x01 arrived first; the sender's store of 0x21 landed
second, resurrecting the exact bit the sender was spinning on. A classic
lost-update — one word, two harts, no atomic.
The BKL had been providing the atomicity that the plain |=/&=~ relied on.
Removing it didn't introduce the race. It revealed it.
Fix: ITF_SET/ITF_CLR macros wrapping atomic_or/atomic_clr (RISC-V
amoor.w/amoand.w). 55 RMW sites converted across ker_fastmsg.c,
ker_message.c, ker_net.c, ker_channel.c, and the nano subsystems.
Read-only tests (& bit) remain plain aligned loads.
Result: 16/16 -P8 boots reach PASS: 300 pidins + login:, zero stalls.
Taskman in U-mode on real hardware. Three deterministic fixes to get U-mode taskman running on the SiFive FU740:
The modpkg_elf.c ELF loader built segment and stack PTEs without PTE_A
and PTE_D. The U74 uses trap-based A/D management — it faults on first
access to a PTE with PTE_A clear. QEMU sets these bits in hardware, so
the bug was invisible there. U-mode taskman faulted on its very first
instruction fetch on real silicon. Fix: set PTE_A on every U-mode leaf
PTE and PTE_D on writable ones (5 sites in the loader), mirroring the
startup MMU code that had the same fix applied in v0.21.
The rsrcdbmgr HWINFO bug: reserve_ranges() was passing
hwi_location.addrspace (a HWI_ADDRSPACE_* kind enum: 0=io, 1=memory)
into get_as_type(), which expected a byte offset into the asinfo section.
A kind of 1 dereferenced a phantom entry at asinfo_base+1; the garbage
offset fed a wild strcmp. On QEMU the garbage at base+1 happened to be
a valid string offset. On real hardware it faulted. Every UART and PCI
device range was silently skipped in rsrcdbmgr; the fix makes them
correctly reserved as MEMORY resources.
A dbgprintf PLT relocation: dbgprintf was in the loader's ELF as a
PLT stub (an imported symbol). During early U-mode taskman boot, before
the dynamic linker had resolved the PLT, any call to dbgprintf jumped
into uninitialized memory. Made it self-contained — no PLT, direct
reference, same pattern as every other early-boot diagnostic function.
After all three: U-mode taskman on the Unmatched boots through the syscall
conformance suite (89/89), enumerates the Samsung NVMe, mounts /usr from
the QRV partition, and reaches the login prompt.
v0.43: The Final Lesson
v0.43 is QRV's last release. It closes a bug that had been open since the U-mode taskman work began on real hardware, and it closes it with a lesson worth writing down.
"The crash that wore many faces"
After v0.42 was working on the Unmatched, a stress test was run: 30
consecutive pidin spawns and exits. The results were consistent: clean
for a few rounds, then a fault of some kind. But not the same fault.
Sometimes an illegal instruction. Sometimes a load page fault at a user
address. Sometimes the process jumped to what looked like garbage code.
Sometimes sepc pointed into the middle of a function, with no plausible
call path to get there.
The initial hypothesis was memory corruption — a write escaping one address space and landing in another's code pages. This was a reasonable guess: the system had been through a complex U-mode migration, the page table code had bugs before. The Porting Story chapter documenting the bug hunt described the investigation in detail, including the places it confidently looked and didn't find the cause.
The actual cause was simpler and more fundamental.
RISC-V I-cache Coherence
RISC-V provides no coherence between data stores and instruction fetch.
When you write bytes to a memory region and then execute them, you must
explicitly flush them from the data cache and invalidate the instruction
cache. The relevant instruction is fence.i.
QRV was not doing this. When the ELF loader wrote a program's instructions into freshly allocated pages, it performed no cache maintenance. The pages were clean in the data cache; the instruction cache on each hart still held lines from whatever had previously occupied those physical pages. A newly spawned process — running on a hart that had recently executed something else in those pages — would fetch stale instructions, branch to garbage addresses, and produce exactly the random-looking fault signatures that had been appearing.
The reason this was invisible on QEMU: QEMU models no instruction cache. Every fetch sees the current memory contents.
The reason -P 1 was clean: with a single hart, the loader and the new
process share the same hart. The fence.i implicit in the context switch
(which flushes the pipeline) is sufficient for single-hart coherence.
The reason the fault appeared "after a few rounds" rather than immediately: page reuse. The allocator returns fresh pages the first time; recycled pages carry the previous tenant's cache lines.
The Fix
cpu_icache_sync_all(): execute a local fence.i and then broadcast an
SBI remote fence.i to all other harts. Called from two sites:
mm_reference when an executable page is initialized, and
pageman_mprotect at the writable→exec transition that every ELF loader
uses.
The call goes through a new kerext __KER_TM_ICACHE_SYNC (since taskman
now runs in U-mode and cannot execute privileged instructions directly).
Result on the Unmatched: ~640 consecutive spawns without a fault. The stress test that previously failed within 30 iterations now runs to completion.
A companion fix: __cpu_membarrier, defined in cpu_inline.h, had been a
bare nop on RISC-V while x86_64 emitted mfence. This was a latent
inconsistency in the portable memory-ordering abstraction — the name implied
a hardware barrier, but RISC-V was getting a compiler barrier only. Fixed to
emit fence rw, rw on RISC-V, consistent with the intent and with the
existing cpu_smp_mb().
Closing
This is the last release of QRV. The goals set at the start are met:
A QNX Neutrino 6.4 microkernel, ported from the original 32-bit ILP32 sources to 64-bit LP64, running on RISC-V. SMP with fine-grained locking and no global kernel lock. Taskman in U-mode with hardware privilege separation. Booting from a real Samsung NVMe drive on a real RISC-V workstation. Multi-user interactive login with SHA-512 authentication. 89 syscall conformance tests passing on real silicon.
From a blank slate on February 27, 2026 — or from a deeper beginning in
December 2020, or from RadiOS in 1998, depending on where you count the
start — to a running system. The partition that was left empty in July 2021
has been /usr for a while now.
The work QRV began does not stop here. A new project is underway, built on the same architectural foundations, aimed at providing a fully free and open QNX-compatible operating system for RISC-V. Its name will be announced in due course.
No comments:
Post a Comment