System log

In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Tuesday, May 26, 2026

QRV v0.41–v0.43: The Lock-Free Era — and a Final Lesson from the Hardware

This is the last development post for QRV. The project set out to port QNX Neutrino 6.4 to RISC-V 64-bit, run it on real hardware, and explore what it would take to bring a clean microkernel architecture to a modern open ISA. Those goals are met. v0.43 is the final release.


v0.41: Intermediate Ground

v0.41 was never really a release — it was a tag on a working branch that happened to have the right shape at a particular moment. Many things didn't work. It served mainly as the foundation for the work that followed, so it gets a brief mention here for completeness.

The headline work in progress at that point: BKL removal on the IPC path. The sync kercalls (MsgSend, MsgReceive, MsgReply, MsgSendPulse) are the kernel's most frequently executed code — the entire message-passing architecture flows through them. Getting them out from under the Big Kernel Lock required a different class of care than the sync primitives that Phase 1 had covered.

The infrastructure assembled during the v0.41 sprint:

Per-object locks. tChannel grew channel_slock, tConnect grew connect_slock, and each tProcess got a vec_slock for its fdcons/chancons vector operations. A klock_debug that previously tracked lock classes was extended to track (class, instance) pairs — necessary once a single kercall legitimately touches two processes (ConnectAttach is the canonical case: client prpc and server prps) and needs to acquire both their vector locks without AB/BA deadlock. Same-instance recursive acquires remain forbidden; same-class different-instance acquires are now legal.

Safe Memory Reclamation (SMR). A minimal port of FreeBSD's smr(9) (BSD-2-Clause, Jeffrey Roberson). Per-CPU sequence counters, lock-free read sections via smr_enter/smr_exit, smr_synchronize to drain in-flight readers before reclaiming memory. Applied to kerext_process_destroy: after vector_rem removes the process from process_vector, smr_synchronize(&process_smr) waits for every in-flight reader that captured a prp pointer before the removal. Only then does the destroy path mutate or recycle prp's memory. Five kercalls (CLOCK_TIME, CLOCK_ID, SCHED_INFO, CONNECT_SERVER_INFO, CONNECT_FLAGS) converted to use lookup_pid_smr and open read sections around every prp dereference.

kprintf re-entrancy. A synchronous trap inside a kprintf call body would re-enter the trap handler, which itself calls kprintf. The inner kprintf_lock() would spin forever on the slock the outer call already held — a deadlock in the diagnostic path, which is the worst possible place for a deadlock. Fixed via a per-CPU in_kprintf[] nesting counter: the outermost caller acquires kprintf_slock; nested callers fall through and interleave their output. Acceptable; the alternative is silence.

SBI DBCN console. The legacy sbi_ecall_console_putc was a single-byte ECALL. Concurrent writers interleaved at byte granularity — [4] !UFLT!sc=c<binary>a)<binary>X... was a real crash log output, not a display artifact. Switched to SBI v2.0 DBCN_CONSOLE_WRITE, which writes an entire buffer in one ECALL, atomic at OpenSBI's M-mode handler. OpenSBI v1.6+ is now a hard requirement.

Four IPC kercalls BKL-less. After all the preparation: all four main IPC kercalls now run without holding the BKL. The races that the preparation closed:

  • A LINKPRIL_REM store-to-NULL in ker_msg_sendv when two CPUs probed the same thp from receive_queue in the gap between the probe and the START_SMP_XFER claim.
  • The same shape in ker_msg_replyv. Both fixed by claiming thp via QRV_ITF_MSG_DELIVERY atomically under channel_slock at the peek, holding the claim through LINKPRIL_REM, clearing it inside the channel lock window after removal.
  • WAIT_PENDING clear in ker_msg_receivev using a compiler barrier instead of cpu_smp_mb() — correct for x86 (total store order), wrong for RISC-V (RVWMO permits store-store reordering). Replaced with fence rw, rw.

v0.42: The Lock-Free Era

v0.42 is the squash of the full bkl-removal branch onto main. The per-commit history is preserved on the branch; the headline is what actually works.

The Big Kernel Lock is gone. Not reduced, not bypassed on a subset of kercalls — gone. The architecture that replaced it:

Fine-grained per-object locking: sync_slock, sched_slock, channel_slock, per-process vec_slock, per-channel chp->lock, per-connect cop->lock, intrevent_slock, clock_slock, alloc_slock. Each lock protects exactly its own subsystem's state. The lock-rank ordering enforced by klock_debug prevents deadlock by construction.

Lock-free message passing: MsgSend, MsgReceive, MsgReply, MsgSendPulse all run BKL-less. The channel queues are protected by channel_slock; the WAIT_PENDING flag bridges the transition window between queue insertion and state change; MSG_DELIVERY claims prevent concurrent wakers from touching the same thread simultaneously.

SMR-retired tThread / tConnect / tChannel: the Safe Memory Reclamation mechanism from v0.41 now covers process lookups. A thread can hold a reference to a tProcess across a context switch; the destroy path synchronizes before reclaiming, so no dangling pointer is possible.

Chain-keyed sync (FreeBSD umtx-style): the sync hash is keyed by (memobj, offset), partitioned per process. Priority-inheritance walks are best-effort under lock-free contention — correctness is preserved, priority boost quality degrades gracefully under high contention.

Self-reap thread/process teardown: dying threads and processes clean up their own resources rather than delegating to a separate terminator thread. The terminator-deadlock class that plagued early versions is structurally impossible.

Per-CPU ready queues: the scheduler dispatch path no longer touches a global queue on every context switch.

Atomic tThread.internal_flags: the final race, found at v0.42-rc12.

The final race. 8 CPUs, 300 pidin processes spawning and exiting in a loop. 15 boots clean. One stalls: the heartbeat goes silent, gdb shows a sender spinning forever at ker_fastmsg.c:445 on thp->internal_flags & QRV_ITF_WAIT_PENDING.

The value of internal_flags in the stuck thread: 0x21. That is MSG_DELIVERY (bit 0, set by the sender as a claim) plus WAIT_PENDING (bit 5, which should have been cleared by the receiver).

What happened: the receiver read internal_flags, computed flags & ~0x20, and was about to store 0x01. The sender concurrently read 0x20 (before the receiver's clear), computed flags | 0x01, and stored 0x21. The receiver's store of 0x01 arrived first; the sender's store of 0x21 landed second, resurrecting the exact bit the sender was spinning on. A classic lost-update — one word, two harts, no atomic.

The BKL had been providing the atomicity that the plain |=/&=~ relied on. Removing it didn't introduce the race. It revealed it.

Fix: ITF_SET/ITF_CLR macros wrapping atomic_or/atomic_clr (RISC-V amoor.w/amoand.w). 55 RMW sites converted across ker_fastmsg.c, ker_message.c, ker_net.c, ker_channel.c, and the nano subsystems. Read-only tests (& bit) remain plain aligned loads.

Result: 16/16 -P8 boots reach PASS: 300 pidins + login:, zero stalls.

Taskman in U-mode on real hardware. Three deterministic fixes to get U-mode taskman running on the SiFive FU740:

The modpkg_elf.c ELF loader built segment and stack PTEs without PTE_A and PTE_D. The U74 uses trap-based A/D management — it faults on first access to a PTE with PTE_A clear. QEMU sets these bits in hardware, so the bug was invisible there. U-mode taskman faulted on its very first instruction fetch on real silicon. Fix: set PTE_A on every U-mode leaf PTE and PTE_D on writable ones (5 sites in the loader), mirroring the startup MMU code that had the same fix applied in v0.21.

The rsrcdbmgr HWINFO bug: reserve_ranges() was passing hwi_location.addrspace (a HWI_ADDRSPACE_* kind enum: 0=io, 1=memory) into get_as_type(), which expected a byte offset into the asinfo section. A kind of 1 dereferenced a phantom entry at asinfo_base+1; the garbage offset fed a wild strcmp. On QEMU the garbage at base+1 happened to be a valid string offset. On real hardware it faulted. Every UART and PCI device range was silently skipped in rsrcdbmgr; the fix makes them correctly reserved as MEMORY resources.

A dbgprintf PLT relocation: dbgprintf was in the loader's ELF as a PLT stub (an imported symbol). During early U-mode taskman boot, before the dynamic linker had resolved the PLT, any call to dbgprintf jumped into uninitialized memory. Made it self-contained — no PLT, direct reference, same pattern as every other early-boot diagnostic function.

After all three: U-mode taskman on the Unmatched boots through the syscall conformance suite (89/89), enumerates the Samsung NVMe, mounts /usr from the QRV partition, and reaches the login prompt.


v0.43: The Final Lesson

v0.43 is QRV's last release. It closes a bug that had been open since the U-mode taskman work began on real hardware, and it closes it with a lesson worth writing down.

"The crash that wore many faces"

After v0.42 was working on the Unmatched, a stress test was run: 30 consecutive pidin spawns and exits. The results were consistent: clean for a few rounds, then a fault of some kind. But not the same fault. Sometimes an illegal instruction. Sometimes a load page fault at a user address. Sometimes the process jumped to what looked like garbage code. Sometimes sepc pointed into the middle of a function, with no plausible call path to get there.

The initial hypothesis was memory corruption — a write escaping one address space and landing in another's code pages. This was a reasonable guess: the system had been through a complex U-mode migration, the page table code had bugs before. The Porting Story chapter documenting the bug hunt described the investigation in detail, including the places it confidently looked and didn't find the cause.

The actual cause was simpler and more fundamental.

RISC-V I-cache Coherence

RISC-V provides no coherence between data stores and instruction fetch. When you write bytes to a memory region and then execute them, you must explicitly flush them from the data cache and invalidate the instruction cache. The relevant instruction is fence.i.

QRV was not doing this. When the ELF loader wrote a program's instructions into freshly allocated pages, it performed no cache maintenance. The pages were clean in the data cache; the instruction cache on each hart still held lines from whatever had previously occupied those physical pages. A newly spawned process — running on a hart that had recently executed something else in those pages — would fetch stale instructions, branch to garbage addresses, and produce exactly the random-looking fault signatures that had been appearing.

The reason this was invisible on QEMU: QEMU models no instruction cache. Every fetch sees the current memory contents.

The reason -P 1 was clean: with a single hart, the loader and the new process share the same hart. The fence.i implicit in the context switch (which flushes the pipeline) is sufficient for single-hart coherence.

The reason the fault appeared "after a few rounds" rather than immediately: page reuse. The allocator returns fresh pages the first time; recycled pages carry the previous tenant's cache lines.

The Fix

cpu_icache_sync_all(): execute a local fence.i and then broadcast an SBI remote fence.i to all other harts. Called from two sites: mm_reference when an executable page is initialized, and pageman_mprotect at the writable→exec transition that every ELF loader uses.

The call goes through a new kerext __KER_TM_ICACHE_SYNC (since taskman now runs in U-mode and cannot execute privileged instructions directly).

Result on the Unmatched: ~640 consecutive spawns without a fault. The stress test that previously failed within 30 iterations now runs to completion.

A companion fix: __cpu_membarrier, defined in cpu_inline.h, had been a bare nop on RISC-V while x86_64 emitted mfence. This was a latent inconsistency in the portable memory-ordering abstraction — the name implied a hardware barrier, but RISC-V was getting a compiler barrier only. Fixed to emit fence rw, rw on RISC-V, consistent with the intent and with the existing cpu_smp_mb().


Closing

This is the last release of QRV. The goals set at the start are met:

A QNX Neutrino 6.4 microkernel, ported from the original 32-bit ILP32 sources to 64-bit LP64, running on RISC-V. SMP with fine-grained locking and no global kernel lock. Taskman in U-mode with hardware privilege separation. Booting from a real Samsung NVMe drive on a real RISC-V workstation. Multi-user interactive login with SHA-512 authentication. 89 syscall conformance tests passing on real silicon.

From a blank slate on February 27, 2026 — or from a deeper beginning in December 2020, or from RadiOS in 1998, depending on where you count the start — to a running system. The partition that was left empty in July 2021 has been /usr for a while now.

The work QRV began does not stop here. A new project is underway, built on the same architectural foundations, aimed at providing a fully free and open QNX-compatible operating system for RISC-V. Its name will be announced in due course.

Wednesday, May 06, 2026

QRV v0.34–v0.40: Taskman Moves to User Mode

Since the project began, taskman has run in S-mode — the RISC-V supervisor mode, shared with the kernel. It was kerlinked into the kernel address space at boot, called kernel functions by name, and reached kernel data structures directly. This was inherited from QNX's procnto-as-kernel-module model and it worked, but it was architecturally wrong: there is no meaningful trust boundary between the kernel and taskman when both run in the same address space under the same privilege level.

v0.40 completes the migration of taskman to U-mode. Taskman now runs as an ordinary user-space process at address 0xC0000000, with its own page tables, communicating with the kernel exclusively through the syscall_tm_priv mechanism. The kernel heap has PTE_U=0. A bug in taskman can no longer corrupt kernel data by accident.

This took about a week of intensive work spread across versions 0.34 through 0.40. Here is how it went.


v0.34–v0.35 — Groundwork: Relocating Bodies, Retiring __Ring0

The first problem was naming. The historical __Ring0(func_ptr, arg) primitive — kernel calls a function pointer supplied by taskman, executed with kernel privilege — was an x86 concept with x86 terminology. RISC-V has no rings. And beyond the name, the mechanism was architecturally dangerous: the kernel jumped to whatever address taskman supplied, with no whitelist beyond a capability flag.

syscall_tm_priv replaces __Ring0. The new mechanism: <sys/kercalls.h> defines a __KER_TM_* sub-namespace of 53 enum constants. syscall_tm_priv(id, arg) is a plain ecall that dispatches through a static kernel-owned table (ker_tm_priv.c), gated on QRV_FLG_PROC_TM_PRIV. The table is a whitelist. A _Static_assert catches enum/table drift at compile time. Every EXPORT_SYMBOL(kerext_*) entry removed from the kernel's symbol table — the kerext bodies are truly kernel-internal now.

This rename was a tree-wide sweep: ~90 call sites, one sed pass.

Phase 0a–0c: kerext bodies moved to kernel. Before taskman text can be marked PTE_U=1, every function that the kernel calls through a function pointer must live in kernel text. S-mode instruction fetch from PTE_U=1 pages is illegal regardless of SUM. Three phases relocated the bodies:

  • Phase 0a: kerext_reparent, SignalKill, and the five exit-path bodies (kerext_pulse_deliver, exit_destroy_threads, channel_destroy, timer_destroy, connect_detach) moved to kernel/kext/.
  • Phase 0b: sysaddr_map, ext_vaddrinfo, ker_manipulate moved via a cross-boundary helper table — the bodies called back into taskman helpers (do_manipulation, pte_map) that still lived in taskman text.
  • Phase 0c: The entire pa allocator cluster (kerext_pa_alloc, _alloc_given, _free, _free_info) moved alongside a new kerext_mm_xboundary table registering 10 function pointers to taskman-side helpers.

Phase 0 ends with the condition: every __Ring0 callback body lives in kernel text. The SPP-flip preconditions are in place.

Phase 1a–1c: arch helpers moved to kernel. pageman_aspace — called on every context switch via memmgr_p->aspace — moved to kernel/arch/riscv/pageman_aspace.c. The Sv39 PTE engine (cpu_pte_manipulate, cpu_pte_merge, cpu_pte_split, sv39_walk, prot_to_pte_bits) moved to kernel/arch/riscv/cpu_pte.c. cpu_sysvaddr_find and cpu_pageman_vaddrinfo moved alongside. A kx_* indirection that was now pointing kernel symbol to kernel symbol was collapsed to direct calls.


v0.35 — pa.c Migrates to the Kernel

The physical-quantum allocator had always been semantically wrong in taskman: physical memory belongs in the kernel by every microkernel convention. After Phase 0c moved the kerext bodies, moving the allocator itself followed naturally.

kernel/kext/pa.c (~1900 lines) now carries all of it: the global state (blk_head[], mem_free_size, mem_reserved_size, restriction chain, quantum pool), the pure helpers (pa_carve, _pa_free, pa_quantum_to_paddr, enqueue_run, dequeue_run), init code, and six new __KER_TM_PA_* slots.

Taskman's pa.c was replaced with a thin syscall-wrapper file (~380 lines). The parts that stayed taskman-side: pa_alloc_fake / pa_free_fake (fake quantum allocator using taskman heap), the high-level pa_alloc retry/purger loop, and pa_free_info's restriction-filter walk.

Six kernel-data globals that taskman had been reaching into directly (mem_free_size, blk_head, etc.) now live on the kernel side and are accessed only through kerext calls. Net leak from kernel data into taskman binary: zero.


v0.36 — kernel/mm/ Promoted to First Class

The kernel needed its own permanent memory manager — one that lives in S-mode kernel text and is callable without indirecting through taskman. Pre-v0.36 the kernel had emm.c ("early memory manager"), a bootstrap stub that handed off to taskman's pageman table at startup. After the U-mode flip, memmgr_p->FOO calls could no longer land in taskman text.

kernel/emm.ckernel/mm/mm.c, symbols emm_*mm_*. init_memmgr wires the full table (mmap, munmap, vaddr_to_memobj, vaddrinfo, mcreate, mdestroy, aspace, pagesize) to kernel-resident symbols:

  • mm_vaddrinfo (new): read-only Sv39 walk returning the leaf PTE, using cpu_pte_lookup. Correct and conservative; no memobj/mm_map walk.
  • mm_mcreate / mm_mdestroy (new): process-aspace setup via cpu_pgdir_create. Allocates pgdir + tAddress. Deliberately omits the mm_map list, rwlock, rlimit — taskman owns that bookkeeping.

kerext_register_memmgr and kerext_register_procmgr defanged to no-ops. The kernel never indirects through taskman text again.


v0.37–v0.38 — First U-Mode Boot and Its Bugs

With the groundwork in place, the Phase 4 work began: actually running taskman in U-mode. This was not a clean single commit — it was a debugging session across multiple evenings, each commit fixing the current crash and revealing the next one.

Phase 4a (v0.34): The predicate for S→U-mode thread entry changed from !QRV_FLG_PROC_RING0 to !QRV_FLG_PROC_LOADING. The distinction matters: taskman has RING0 set permanently (it's a privileged process), but its pool threads should run U-mode. Loader threads — which run in the child process's context during spawn — should stay S-mode until ProcessStartup promotes them.

Phase 4b: Marking taskman text PTE_U=1 in kerlink. This made taskman reachable from U-mode but also made it readable from every user process (the kernel L2 is shared). Acknowledged as a transitional step; Phase 5's ET_DYN taskman in its own address space fixes the isolation.

The first boot to message_start: Eight orthogonal fixes in one commit that together carried taskman through all init phases in U-mode: PTE_U=1 on the kernel heap, SUM set early, a kerext for reading satp from U-mode, CPIO ramdisk PA → high VA conversion, rlimits initialised for taskman, PROC_LOADING cleared after the first taskman thread, pool stacks pre-allocated (96 × 8 KiB via pageman_mmap before thread_pool_start), and aspace_prp = prp set uniformly for all pool threads.

main() must not return (v0.38). Taskman is now a real ET_EXEC. When taskman_main returns, libc's crt0 calls exit()_exit()_PROC_EXIT → procmgr → ProcessThreadsDestroy → tear down all of taskman's threads. Fix: pthread_exit(NULL) instead of return.

The NOZOMBIE race (v0.38). A subtle spawn-ordering bug: every process is created with QRV_FLG_PROC_NOZOMBIE set. Before the U-mode flip, proc_start cleared it for normal spawns. After the flip, kerext_register_procmgr is a no-op, so procmgr_p->process_start is NULL, and NOZOMBIE stays set permanently. Every parent's waitpid returns ECHILD immediately. The init script became fire-and-forget instead of synchronous. Fix: replicate the NOZOMBIE clearing in proc_loader.c.

mm_map_xfer wired (v0.37). memmgr_p->map_xfer had been NULL since the kernel/mm/ promotion — left as a stub with an explicit "deferred" comment. The first time a pool thread's msg_replyv had a cross-aspace destination, the kernel called NULL. The fault chain was interesting: the NULL jalr → PC=0 → xfer fault recovery path → _longjmp into a stale jmp_buf whose bytes held kprintf output → PC = 0xffffffc0000a3036 → second instruction page fault → "XFER BUG" banner. mm_map_xfer is now a proper Sv39 walk.

LP64 intrinfo truncation (v0.38). init_intrinfo runs before switch_to_high_va. GCC under -mcmodel=medany resolves symbol addresses PC-relative at the PA, not the linked VA. The stored mask/unmask function pointers were PA values (0x80207b88) instead of kernel VAs (0xffffffc080207b88). Every subsequent ilp->info->unmask() call jumped to low-half user space, which isn't mapped. Fix: KVA(sym) — OR the symbol address with KERNEL_VIRT_BASE.


v0.39–v0.40 — Phase 5: The Real Architecture

Phase 5 was the fundamental ABI redesign. All of the Phase 4 work had taskman running in U-mode while still touching kernel heap data directly (the heap was PTE_U=1). Phase 5 flips the heap to PTE_U=0 and builds the full structure to make U-mode taskman correct.

tTMprocess: taskman-side process state. The kernel-allocated tProcess contains kernel-managed fields. But it also carried POSIX-policy fields (pgrp, sid, session, umask, root, cwd, guardian, siginfo, events, resource lists, conf table) that belong in taskman. Under PTE_U=0, taskman can no longer dereference a tProcess * for any of these.

The solution: taskman/include/tm_process.h defines tTMprocess — a fixed-size BSS array of TM_PROCESS_MAX slots indexed by PINDEX(pid). Every POSIX-policy field migrated from kernel-half tProcess to tTMprocess. The locking state for the address space (rwlock_lock, fault_owner) migrated from tAddress to tTMprocess. The kernel retains only what it genuinely owns: scheduling state, IPC vectors, trap/fault state, credentials.

KERNEL_INTERNAL struct gating. Every kernel-private struct body (tProcess, tThread, tChannel, tConnect, and 15 others) is now gated behind #ifdef KERNEL_INTERNAL. Kernel sources define KERNEL_INTERNAL and see full struct bodies. Taskman sees only the typedef pointer-to-incomplete. Any prp->X deref in taskman code is now a compile error — by design.

pa_pq accessors. struct pa_quantum lives on the kernel heap (PTE_U=0). Direct field access from U-mode taskman faults. The new pa_pq_flags() / pa_pq_blk() / pa_pq_run() / pa_pq_modify_flags() inline wrappers route through __KER_TM_PA_PQ_OP.

80 TM_PRIV slots. The dispatch table grew from 53 to 80 entries during Phase 5. 25 new slots cover everything taskman needs to do that involves kernel data: PA_ALLOC_FULL, PA_PQ_OP, ASPACE_MEMCPY, KPAGE_ZERO, REGISTER_POOL_STACK, PROC_QUERY, PROC_SET_SESSION, CHANNEL_HAS_RECEIVERS, MMAP_PHYS_USER, MMAP_ANON_USER, and more.

taskman linked as ET_EXEC at 0xC0000000. The old ET_DYN-against-libc.qrl build required libc.qrl to live in the kernel heap (PTE_U=0 after Phase 5 → taskman can't reach it). The new build: -nostdlib -static -Ttext-segment=0xC0000000, libc.a inside the link group. Every libc symbol now lives inside taskman's own image. No DT_NEEDED, no PLT/GOT for libc.

0xC0000000 is Sv39 L2[3]. cpu_pgdir_create copies only L2[1] and L2[256..511] into every user pgdir. L2[3] stays zero in every user process — taskman's image is invisible to other processes.

The loader thread problem. During spawn, a loader thread runs in the child process's context but needs to fetch taskman text. The solution: mm_taskman_view_install(child_prp) temporarily copies taskman's L1 page pointer into the child's pgdir at L2[3]. ProcessStartup calls mm_taskman_view_remove(child_prp) before the first U-mode sret into the child — at that point the child is fully isolated.

Pool stacks registered by taskman. The kernel's procmgr_stack_alloc fallback to memmgr_p->mmap was removed — the kernel heap is PTE_U=0, unusable as a user thread stack. Taskman pre-allocates 96 user-VA stacks via TaskmanMmapAnonUser and pushes each into the kernel's free list via __KER_TM_REGISTER_POOL_STACK. The kernel contributes only the free-list structure; taskman contributes the memory.

826 → 0 warnings. The final commit of the cycle drove the warning count from 826 to zero. 12 were genuine QRV-proper issues fixed at the root. 814 were -Wconversion in the mksh fork — suppressed at the Makefile level with an explicit rationale: the mksh fork is diverging and will eventually be renamed qsh; fixing 800 upstream conversion sites would cost effort that belongs in the architectural work instead.


What This Means

Taskman is now an ordinary user-space process. It communicates with the kernel through a defined, whitelist-gated ecall interface. A bug in taskman — a corrupted pointer, an out-of-bounds write — cannot reach kernel data. The KERNEL_INTERNAL guard makes this a compile-time property, not just a runtime one: the compiler rejects any attempt to dereference a kernel struct from taskman code.

The BKL work and the U-mode work ran on sequential branches. They are now merged into a single tree. The system boots, logs in, and runs on real hardware with both changes in place.

Friday, May 01, 2026

QRV v0.28–v0.33: Breaking Up the Big Kernel Lock

Two ambitious project branches emerged from the v0.27 milestone. The first is removing the Big Kernel Lock. The second — lifting taskman to user mode — gets its own post. This one covers v0.28 through v0.33: the foundation work, the first measurements, and the first kercall to run without holding the BKL.


v0.28 — Phase 0: Infrastructure and Baselines

Before any lock can be removed, you need to know what you're removing. Phase 0 landed two things: the structural plumbing for fine-grained locking, and the instrumentation to measure the current BKL cost.

Per-CPU infrastructure. kernel/include/percpu.h introduces DECLARE_PER_CPU / DEFINE_PER_CPU / this_cpu_{ptr,read,write,inc,dec} — thin macros over plain [PROCESSORS_MAX] arrays indexed by RISC-V hart ID. Alongside it: kernel/include/klock.h defining kspinlock_t, a lock-class rank ordering (KLOCK_CLASS_SCHED, KLOCK_CLASS_SYNC, KLOCK_CLASS_CHANNEL, etc.) for a future lockdep, and a per-CPU preempt_count word.

am_inkernel() triage. The twelve call sites of am_inkernel() turned out to express two different things. bkl_held() means this CPU holds the BKL. in_kernel_context() means this CPU is in a kernel critical section of any kind. Today both have the same body — but once Phase 1 introduces per-subsystem locks they will genuinely diverge.

bkl-stress and the first baseline. A synthetic stresser that spawns one pthread per hart and hammers one of three workloads: mutex (every lock/unlock hits SyncMutexLock/SyncMutexUnlock), sleep (1 ms nanosleep loops), or nop (pure user-mode, control). Results from a five-second QEMU -smp 4 run:

boot baseline:              34.6 % BKL contention rate
bkl-stress sleep 5s:        58.6 % contention, spread across 4 harts
bkl-stress mutex 5s:        71 %  contention, ~2.17 M iter/s

Those numbers are the target. Phase 1 will quote against them.

BKL instrumentation. kernel/bkl_stats.c adds per-CPU counters (acquires, contended acquires, spin cycles, held cycles, log2 histogram of spin durations) around acquire_kernel / release_kernel, all gated behind CONFIG_BKL_STATS (default off). The taskman sysfs grew a synth() callback so /sys/bkl can snapshot the kernel counters at any open. When the config is off, the hot path is byte-identical to v0.27.

First measured baseline on a fresh boot: ~28,800 BKL acquisitions across four harts, 34.6 % contention rate, spin distribution peaking in the 2^14–2^17 cycle range. That is roughly 9.8 kcyc per BKL critical section on average.


v0.29 — Phase 1.1–1.2: Fine-Grained Locks Declared

Two structural locks declared, neither yet contended — that is the deliberate pattern.

sync_slock (Phase 1.1) wraps every read and write of sync_hash and the tSync chains hanging off it. Every sync kercall still holds the BKL around the entire operation, so sync_slock is uncontended and idempotent today. It exists to tag every sync-hash touchpoint so that Phase 1.5 can lift the BKL from sync kercalls without re-auditing every call site. One notable exception: synchash_rem was left BKL-only, because its inner unlock_kernel/lock_kernel cycle to safely probe a user-space sync->owner makes a naive spinlock wrap unbounded.

sched_slock (Phase 1.2) declared alongside sync_slock. The scheduler primitives (force_ready, ready_default, block, unready) compose more aggressively than the sync primitives — force_ready calls ready, unready calls block, block calls select_thread via FIND_HIGHEST — and ready_default alone has five early-return paths. Threading a lock through all of that without knowing exactly which paths will run BKL-less is premature. The lock is there; the wrapping waits for Phase 1.5.

CONFIG_KLOCK_DEBUG. An embryonic lockdep: kspinlock_t gains a class field; kspinlock_lock checks that the new class ranks strictly above every class on the per-CPU held-stack. On violation: kprintf + crash(). Default off; when on, a bkl-stress run over Phase 1.1 produces zero violations.


v0.30 — Phase 1.5a–e: First BKL Lift

The distance between "lock declared" and "BKL removed from a kercall" is a series of preparatory steps that each need to be correct before the next one is safe.

P1.5a. synchash_rem wrapped in sync_slock, dropping the lock around the user-space sync->owner probe. The prerequisite that had been deferred from Phase 1.1.

P1.5b. sync_wakeup, mutex_holdlist_add, and mutex_holdlist_rem now take sync_slock internally around their wait-list mutations. They drop the lock across ready() calls — SCHED < SYNC in the rank order, so calling a scheduler primitive with a sync lock held would be a rank inversion.

P1.5c. ready_default and adjust_priority_default wrapped in sched_slock. Five exit paths, drop-and-retake around calls that recurse.

P1.5d. force_ready's STATE_MUTEX block and adjust_priority's STATE_CONDVAR/SEM/MUTEX branches now take sync_slock around their wait-list mutations. A _locked variant of mutex_holdlist_rem avoids recursive acquire.

P1.5e — the first lift. __KER_SYNC_MUTEX_UNLOCK marked in kercall_no_bkl. __kercall_dispatch calls release_kernel() after entry-time bookkeeping, runs the handler, and re-acquires before the tail. The body of sync_mutex_unlock was restructured: *owp = 0 stores switched to _smp_xchg_ul(owp, ...) (amoswap.d.aqrl) for acquire+release semantics; every wait-list read under sync_slock.

A bug caught in passing: atomic_set_64(owp, value) emits amoor.d (bit-set, not assign). ORing zero into *owp is a no-op — the mutex never released, system hung. The correct primitive is _smp_xchg_ul. This discovery also led to a tree-wide rename: atomic_setatomic_or everywhere, to match actual semantics.

Results from P1.5e:


P1.5d (BKL) P1.5e (lifted)
Acquires 1.94 M 1.42 M
Contended 1.55 M (80 %) 1.35 M (95 %)
Spin cycles 8.51 Gcyc 4.18 Gcyc (−51 %)
Held cycles 23.5 Gcyc 12.25 Gcyc (−48 %)
Held cyc/acquire 12.1 kcyc 8.6 kcyc (−29 %)

Half the BKL spin time and half the BKL held time. That is the first real measurement of what BKL removal is worth.


v0.30 (continued) — Phases 1.5f–m: All Sync Kercalls Lifted

The remaining six sync kercalls followed in sequence.

P1.5f lifted __KER_SYNC_CONDVAR_SIGNAL. Body is a single sync_wakeup call, which was already self-locking since P1.5b. Straightforward.

P1.5g wrapped block, unready, and block_and_ready in sched_slock. These three are the primitives the upcoming condvar/sem/mutex lifts need.

P1.5h wrapped the standalone pril_add/pril_rem call sites in ker_sync.c in sync_slock. Prerequisite for the block-path lifts.

P1.5i introduced QRV_ITF_WAIT_PENDING — a per-thread flag set during the transition window between adding a thread to a wait list and completing the unready. Wakers on other CPUs spin-wait until the flag clears. This is the primitive that makes the remaining lifts safe: without it, a thread can be on the wait list while still STATE_RUNNING, and a concurrent ready() would CRASHCHECK.

P1.5j–k lifted sem_post and sem_wait. One bug surfaced in sem_wait: block() (called inside unready) clears thp->blocked_on = NULL in its default case. The first draft of the body assigned blocked_on = syp before unready, which got clobbered. Fix: assign after unready, with WAIT_PENDING bridging the window.

P1.5l lifted condvar_wait. The coupling between kmutex_t and condvar was the hard part: kmutex_t never enters sync_hash (libc's cmpxchg fast-path short-circuits the kernel), so sync_lookup's auto-init failed after a wakeup on a system without the BKL. Fix: force-register the bound mutex in sync_hash from the condvar_wait body before releasing it.

P1.5m lifted mutex_lock and mutex_revive — the last two. The priority-inheritance walk through mutex_lock is the most complex body in the sync subsystem. It stays "best-effort" under contention: stale reads in the walk degrade priority inheritance but don't break correctness. The canonical waiter pattern from P1.5i applies here too.

Hardware fences. An important correction mid-lift: the inline __asm__ volatile("" ::: "memory") barriers in the WAIT_PENDING handshake were compiler-only. RVWMO permits store-store and load-store reordering at hardware level. Replaced with cpu_smp_mb() — a portable abstraction defined per-arch as fence rw, rw on RISC-V and mfence on x86_64.

Phase 1.5 exits with all seven sync kercalls running BKL-less.


v0.31 — PRIL/LINK3 Alias Corruption

After Phase 1.5 completed, a rare but real crash surfaced during devb-nvme bringup:

*** PRIL CORRUPTION: HEAD prev.prio_tail priority mismatch ***
CRASH: nano/nano_misc.c:328

Root cause: PRIL_ENTRY_FIELDS puts a union at the start of tThread. The same first 16 bytes serve as either a PRIL wait-list (next.pril, prev.prio_tail) or a LINK3 dispatch list (next.thread, prev.thread). A thread cannot legally be on both simultaneously.

The P1.5l-lifted ker_sync_condvar_wait does: pril_add (puts act on wait list), then sync_mutex_unlock (which calls ready(waiter)ready_default, which in the SMP "can replace this CPU" path calls LINK3_BEG(dispatch_queue, act, tThread) — writing act->next.thread and act->prev.thread, the same bytes as the PRIL fields just set). The prev.prio_tail that should equal act (for a HEAD+TAIL single-element list) now holds a stale dispatch-queue backlink.

Fix: in ready_default, before bumping act, check QRV_ITF_WAIT_PENDING. If set, fall through — act will finish its own transition and ker_exit will pick it up. The bump is skipped.

Verification: 20/20 clean boots, zero PRIL corruption.


v0.32–v0.33 — Channel Lock and Trace Ring

v0.32 began Phase 2 of BKL removal: the IPC path. channel_slock declared (rank between INTREVENT and CONNECT). The wait-list primitives in nano_message.c (remove_unblock, net_send2, net_sendmsg) and the channel-destroy and pulse paths in nano_pulse.c and ker_channel.c now take channel_slock around their queue mutations. All of this is still uncontended — the BKL is still held by every caller. The four target IPC kercalls (MsgSend, MsgSendPulse, MsgReceive, MsgReply) got their queue-op sites wrapped and WAIT_PENDING spin-waits inserted in preparation for the eventual BKL lift.

v0.33 added a lockless per-CPU function-entry trace ring: 32 entries per CPU, a TRACE_FN(id, ctx, arg) macro, and a trace_ring_dump(cpu) called from the crash path. During the BKL hunt this was the tool that revealed how the __KER_MSG_RECEIVEV canonical waiter pattern interacted with the timer-trap-handler's in_kernel_context() routing to leak BSS globals into active thread register save areas. The ring stays in the tree — it will be useful again.


Where Things Stand

Seven sync kercalls run BKL-less. Phase 2 infrastructure is in place for the IPC path. The BKL is not gone — that was never the claim — but the structural work to remove it section by section is well underway, documented, and measured.

Monday, April 27, 2026

QRV v0.27: Multi-User Login — Two Months

Exactly two months ago today, on February 27, the project restarted.

v0.27 is the milestone that closes that chapter. The system boots on real RISC-V hardware, enumerates a Samsung NVMe, mounts a filesystem, runs a staged init sequence, presents a login prompt, authenticates a user against a shadow password file using SHA-512 crypt(3), drops privileges, and hands off to mksh with the correct working directory and environment. The session below is from the SiFive HiFive Unmatched.

=> run bootqrv

397328 bytes read in 0 ms
4421120 bytes read in 4 ms (1 GiB/s)

Starting kernel ...

Boot command line: -Dsbi mainfs=/dev/nvme0n1p8
Board: SiFive HiFive Unmatched A00
Compatible: sifive,hifive-unmatched-a00
+------------------------------------------+
| QRV Operating System Kernel version 0.27 |
+------------------------------------------+
init_clocks
init_raminfo
init_mmu
  ram_top = 480000000
  prealloc_kernel_l2: filled 239 empty kernel L2 slots
  enable_mmu: Sv39 paging active
init_intrinfo: timer(base=5,n=1) plic(base=32,n=128)
init_cpuinfo: 4 CPUs, rv64imac, 1 MHz
hwinfo: serial sifive,fu740-c000-uart @ 10010000 irq 39
hwinfo: pci sifive,fu740-pcie ecam df0000000/10000000 irq 57 windows 4
...
[2] kerlink: boot/taskman.qkx: resolved 6190/6190 external symbols (0 unresolved)
[2] kerlink: boot/taskman.qkx: applied 10756/10756 relocations (0 skipped)
...
[2] cpu_start_others: hart 1 started
[2] cpu_start_others: hart 3 started
[2] cpu_start_others: hart 4 started
[2] bootimage: spawned /sbin/init (pid 2)

***********************************
* Welcome to QRV Operating System *
***********************************

Starting serial console (devc-sersifive)...
system console set to /dev/ser1

Probing for NVMe...
nvme:   Model:    "SAMSUNG MZVL2512HCJQ-00B00"
nvme:   Serial:   "S675NF0R487649"
nvme: ns 1  capacity=488386 MiB
devb-nvme: GPT OK, 8 partition(s)
  ...
  p6: LBA 144941056..857972735 (348160 MiB) name="Debian"
  p7: LBA 857972736..891527167 (16384 MiB)  name="swap"
  p8: LBA 891527168..895721471 (2048 MiB)   name="QRV"

Mounting /dev/nvme0n1p8 on /usr...
fs-qrv: qrvfs v2, 4096 blocks, 128 inodes
fs-qrv: mounted qrvfs at /usr (dev=/dev/nvme0n1p8)

Sysinit: level1 running.

QRV 0.27 (2026-04-28) on sifive,hifive-unmatched-a00

login: qrvuser
Password:

Welcome to QRV!

[/home/qrvuser]$ echo "Welcome to QRV on SiFive Unmatched!"
Welcome to QRV on SiFive Unmatched!
[/home/qrvuser]$

What This Required

Filesystem namespace flip

The most visible architectural change: the CPIO modpkg now mounts at / instead of /rd/, and the NVMe filesystem mounts at /usr. The init script, all binaries, and drivers live under the cpio root's bin/, sbin/, lib/. Everything that belongs on persistent storage — config, user home directories, extended tools — lives under /usr, contributed by the qrvfs partition on NVMe. /etc resolves to /usr/conf via a path-manager symlink. /home resolves to /usr/home the same way.

This is the Linux initramfs/rootfs split, applied to QRV's architecture. It means the init script can be edited on disk without rebuilding the CPIO image. It means /etc/passwd is just a file on the NVMe partition.

Staged init

The cpio /sbin/init is a minimal mksh script — about 70 lines — whose only job is to get the console up, probe NVMe, and mount /usr. If that fails, it drops to a rescue shell. If it succeeds, it hands off to /usr/sbin/sysinit/level1.sh on the mounted disk.

level1.sh is where the real system initialization lives: pci-server, devb-nvme, fs-qrv, slogger, and finally getty. Because it lives on the NVMe partition, it can be edited, extended, and structured however needed — conditionals, sourced fragments from /usr/conf/sysinit/*.sh, anything mksh can express. A BSD-style rc framework can grow from here without touching the cpio image.

Platform detection in init

The init script no longer starts both UART drivers and lets the wrong one exit silently. It reads /sys/board — a new sysfs file populated from the FDT /compatible property at boot — and picks the correct driver:

read -r BOARD < /sys/board
case "$BOARD" in
    *sifive*) serdrv=devc-sersifive ;;
    *)        serdrv=devc-ser8250   ;;
esac
$serdrv

On QEMU virt: BOARD=riscv-virtio, driver=8250. On the Unmatched: BOARD=sifive,hifive-unmatched-a00, driver=sersifive. One init script, both platforms.

/sys filesystem

A new sysfs resource manager serves /sys — a Linux-style companion to /proc for kernel self-description. Initial entries: /sys/cmdline (the kernel boot command line, verbatim) and /sys/version (version string and build date). /sys/board was added during platform detection work. The kernel snapshots the command line before its in-place token split, so /sys/cmdline always reflects exactly what U-Boot passed.

Canonical-mode line discipline

Both UART drivers (devc-ser8250 and devc-sersifive) now implement a proper POSIX line discipline in ICANON mode: input accumulates in a line buffer, a line is delivered to the reader only on \n, and backspace erases correctly — both ASCII BS (0x08) and DEL (0x7F) are accepted, because different terminal emulators send different bytes. ECHOE echoes \b \b per erase. Without this, login: and Password: prompts couldn't be corrected with backspace before pressing Enter.

crypt(3) and the full authentication chain

The complete authentication chain is now in place:

  • SHA-512 implementation lifted from FreeBSD (sys/crypto/sha2/sha512c.c, stripped to exactly what crypt(3) needs)
  • SHA-crypt ($6$salt$hash) per Drepper's spec, from FreeBSD's lib/libcrypt/crypt-sha512.c
  • getpass(3) disabling ECHO via tcsetattr for password entry
  • getpwnam / getspnam reading /etc/passwd and /etc/shadow
  • login(1) comparing the entered password against the shadow hash, then calling setgid() + setuid() and exec'ing the user shell

getuid() / getgid() and their effective variants are now real: they call ConnectClientInfo(-1, &info, 0) — a kernel call that returns the calling process's own credential record directly, with no message-passing round-trip and no self-message hazard. This is the openqnx pattern and it is the right one: credentials come from the kernel's own credential store, not from a procmgr query that would deadlock if called from within taskman itself.

Getty and login

/sbin/getty opens the terminal, prints the banner (QRV <version> (<date>) on <board>), reads the login name via fgets, and spawns /bin/login. It self-respawns on shell exit, so a disconnected session automatically presents a new login prompt. /bin/login validates credentials, sets HOME, PATH, SHELL, USER, LOGNAME via setenv(3), changes to the home directory, and exec's the shell.

[/home/qrvuser]$ in the prompt is \w expanded by mksh's prompt preprocessor at every redraw — the canonical directory, through the /home → /usr/home symlink, shown correctly rather than the raw resolved path.


Two Months

February 27 to April 27. In that time:

  • Kernel booted to idle
  • First ecall, first channel, Sv39 paging
  • taskman loaded and running, 300+ symbols resolved
  • Full IPC round-trip, dynamic linking, user-mode shell
  • SMP stabilized
  • PCI server, NVMe driver, GPT partitions
  • Filesystem (qrvfs), mksh, signals, setjmp
  • Multi-user login on a RISC-V workstation with 16 GiB of RAM and a 488 GiB Samsung NVMe

The partition that was left empty in July 2021 is now /usr.

The work continues.

Sunday, April 26, 2026

QRV v0.26: mksh — A Real Shell for a Real System

esh served its purpose. It got QRV to a shell prompt, proved the IPC stack, and ran through two months of bring-up. But it has no variables, no conditionals, no scripting. What QRV has needed for a while — and now has — is a real shell.

v0.26 lands mksh, the MirBSD Korn Shell (R59, 2025-04-26), ported from scratch in a single day. It runs interactively, handles history with cursor keys, shows the current directory in the prompt, and runs sequences of commands without exiting:

MIRBSD KSH QRV-1 (mksh R59 base)
[/]# ls
bin  dev  proc  rd
[/]# cd /disk2/bin
[/disk2/bin]# ls -la
[/disk2/bin]# ./syscall_testing
...
=== Results: 89 tests, 89 passed, 0 failed ===
[/disk2/bin]#

The [/disk2/bin]# prompt is live: mksh re-evaluates $PWD on every prompt redraw. The cd you just ran is reflected immediately.

This matters beyond the convenience. A real shell means the init script that lives in modpkg.cpio — the script that starts every daemon at boot — can now have conditionals, variables, functions, and proper error handling. That is the foundation of a BSD-style rc system: hardware probing, conditional driver loading, different boot scenarios from a single init script. The modest CPIO image suddenly has real expressive power.


What the Port Required

mksh is about 37,000 lines of C. The upstream codebase carries three decades of platform accretion: EBCDIC support, OS/2, z/OS, Windows, AIX, Solaris, multiple editor modes (emacs, vi, gmacs), a German-language boolean type (Wahr/Ja/Nee/isWahr), and a 3,147-line monolithic sh.h that aggregates everything from every platform the shell has ever run on.

None of that goes into QRV. The port started with explicit disciplines: Apache-2.0 relicensing (enabled by the MirOS sublicensing right, with the original MirOS text preserved), K&R formatting, no fork() by design, poll() instead of select() as a QRV project rule, and every line earning its place.

sh.h: from 3,147 lines to 244

The original sh.h was replaced with a 244-line foundation containing only what the .c files actually use. Everything else was extracted into topical headers: env.h, shf.h, cclass.h, var.h, lex.h, proto.h, tree.h, msgs.h. Each domain in its own file, with a one-line rationale for every declaration that remains.

German booleans

931 substitutions, automated:

Wahr   → bool      (296 hits)
Ja     → true      (292)
Nee    → false     (309)
isWahr → ((bool))  (34)

These were fine in mksh. They have no place in QRV.

The line editor

mksh's edit.c is 5,832 lines: emacs mode, vi mode, gmacs mode, kill rings, undo, configurable bind tables, modal editing, vi tab-complete heuristics. QRV needs exactly none of that. A fresh line editor was written from scratch — 553 lines — covering the keystrokes that actually matter:

^A/^E/^B/^F       home / end / back / forward char
^P/^N + arrows    previous / next history
^R                reverse incremental history search
^K/^U/^W          kill-to-eol / kill-line / kill-word-back
^L                redraw
DEL/^H/^D         backspace / delete / EOF
ESC [A/B/C/D      cursor keys

Net: −5,279 lines of code the QRV shell will never need.

No fork()

QRV has no fork() by design. The original exchild() in jobs.c was 135 lines of fork-then-child-or-parent bookkeeping. The replacement is 25 lines: run the AST in this process, update job bookkeeping. External commands go through posix_spawn() + waitpid() at the TEXEC site in exec.c. The shell stays resident throughout.

This was also the source of v0.26's last bug. The fork-removal rewrite unconditionally passed XEXEC to execute(). In upstream mksh, XEXEC means "I'm a forked child — call unwind(LEXIT) when done." With no child, that unwind exited the shell after the first external command. Found and fixed: the test is simple in retrospect — run ls, then run pwd, and check whether the shell is still alive.

libc foundations

The port surfaced three missing pieces in QRV's libc, all added as part of this release:

setjmp/longjmp. <setjmp.h> had been declared but never implemented. The RISC-V assembly saves callee-saved integer registers (ra, sp, s0s11), mirroring the layout of the kernel's existing xfer-fault recovery path. Nine new tests: round-trips, the longjmp(env, 0) → 1 POSIX rule, callee-saved values surviving across call boundaries.

posix_spawn / posix_spawnp. The kernel side (_PROC_POSIX_SPAWN) and the spawnattr_* / file_actions_* helpers already existed. The entry point symbols were simply missing. Adapted from openqnx's reference implementation with the Adaptive Partitioning wire format removed (it was removed from QRV's taskman in an earlier release).

Signal API. POSIX signals implemented on top of QRV's existing pulse model. A per-process signal table in libc; the system thread that every QRV process already has translates _PULSE_CODE_SIGNAL pulses into sigaction-registered handlers. sigaction, sigprocmask, signal, raise, and kill all work. 19 new tests, all passing. raise(SIGUSR1) runs the handler synchronously before returning; kill(getpid(), SIGUSR2) round-trips through taskman → pulse → system thread → handler.

The test suite grew from 61 to 89 tests across this release: 9 for setjmp/longjmp, 19 for signals. All 89 pass.


qrvfs Format v2

One more thing that landed quietly: the filesystem got a format upgrade. qrvfs v1 used 9 direct + 1 single-indirect block, capping file size at ~2 MiB. That was fine for a test image. It is not fine for anything approaching real use.

v2 repartitions the same 10 address slots: 7 direct + single + double + triple indirect. Maximum file size: ~512 GiB. mkfs-qrv was rewritten around a recursive walk_indirect() that allocates index blocks lazily. A 1.1 GiB test file round-trips byte-perfect, exercising the triple indirect path.

v2 superblocks are incompatible with v1. qrvfs_init() rejects v1 images with a clear error.


What Comes Next

Background jobs (posix_spawn with detach rather than waitpid), poll() with real readiness probing (wiring _IO_NOTIFY through the resmgr layer), and getppid() / umask() / ioctl() / tcflush() — the eight libc stubs that currently back the shell's less-used features. Then the init script gains real conditionals, and the rc system starts to take shape.

Saturday, April 25, 2026

QRV v0.25: Booting from Real NVMe on Real Hardware

This is the follow-up to the earlier v0.25 post. Everything described there — the sync_hash fix, the DesignWare iATU, the INTx wiring, the mount command — was validated on QEMU. This post is about what happened when the same build ran on real silicon.

The short version: it works.


How QRV Now Boots on the Unmatched

U-Boot loads the QRV kernel binary and modpkg.cpio directly from the Debian ext4 partition (nvme 0:6) on the Samsung NVMe:

=> ext4load nvme 0:6 0x80200000 /boot/qrv/qrv-kernel.bin
397328 bytes read in 0 ms
=> ext4load nvme 0:6 0x90000000 /boot/qrv/modpkg.cpio
4688384 bytes read in 5 ms (894.2 MiB/s)
=> setenv initrd_size ${filesize}
=> booti 0x80200000 0x90000000:${initrd_size} ${fdtaddr}

QRV lives as a guest on its own machine, loading its boot files from the Debian partition's /boot/qrv/ directory. The QRV filesystem itself is on partition 8 — 2 GiB, type GUID 51525611-322e-4017-bae8-e4d9c9d4e979 — a custom GUID registered for QRV, so partition tools can identify it unambiguously.


The Boot Log

Here is the boot session.

Starting kernel ...

===============================================================================

Boot command line: -Dsbi
Board: SiFive HiFive Unmatched A00
Compatible: sifive,hifive-unmatched-a00
+------------------------------------------+
| QRV Operating System Kernel version 0.25 |
+------------------------------------------+
init_clocks
init_raminfo
asinfo: PCI IO  60080000 - 6008ffff
asinfo: PCI MEM 60090000 - 6fffffff
asinfo: PCI MEM 70000000 - 70ffffff
asinfo: PCI MEM 2000000000 - 3fffffffff
init_mmu
  ram_top = 480000000
  identity_map RAM 80000000..480000000
  ...
  prealloc_kernel_l2: filled 239 empty kernel L2 slots
  enable_mmu: Sv39 paging active
init_intrinfo: timer(base=5,n=1) plic(base=32,n=128)
init_cpuinfo: 4 CPUs, rv64imac, 1 MHz
hwinfo: serial sifive,fu740-c000-uart @ 10010000 irq 39
hwinfo: serial sifive,fu740-c000-uart @ 10011000 irq 40
hwinfo: pci sifive,fu740-pcie ecam df0000000/10000000 irq 57 windows 4
hwinfo: pci dw_pcie_msi dbi e00000000/80000000 irq 56
...
[1] kerlink: boot/taskman.qkx: resolved 6114/6114 external symbols (0 unresolved)
[1] kerlink: boot/taskman.qkx: applied 10631/10631 relocations (0 skipped)
[1] kerlink: boot/taskman.qkx: link complete
...
[1] cpu_start_others: hart 2 started
[1] cpu_start_others: hart 3 started
[1] cpu_start_others: hart 4 started

***********************************
* Welcome to QRV Operating System *
***********************************

     pid tid name               prio STATE       Blocked
       1   1 taskman             10r READY
       ...
       2   1 esh                 10r REPLY       1

Starting slogger...
Starting pci-server...

PCI devices:
00:00.0 PCI bridge [0604]: f15e:0000 (rev 00)
01:00.0 PCI bridge [0604]: 1b21:2824 (rev 01)
04:00.0 USB controller [0c03]: 1b21:1142 (rev 00)
06:00.0 NVM controller [0108]: 144d:a80a (rev 00)
07:00.0 VGA compatible controller [0300]: 10de:128b (rev a1)
07:00.1 Multimedia controller [0403]: 10de:0e0f (rev a1)

Probing for NVMe...
devb-nvme: controller at 06:00.0 BAR0 0x60400000
devb-nvme: controller enabled (CSTS.RDY=1)
nvme: controller VID:DID 144d:144d  ctrl-id 6
  Model:    "SAMSUNG MZVL2512HCJQ-00B00"
  Serial:   "S675NF0R487649"
  Firmware: "GXA7301Q"
nvme: ns 1  size=1000215216 LBAs  lba=512 B  capacity=488386 MiB
devb-nvme: interrupts on (IRQ=57 vector=89)
devb-nvme: GPT OK, 8 partition(s)
  p1: LBA 34..2081 (1 MiB)      type=5b193300-...  name=""
  p2: LBA 2082..10273 (4 MiB)   type=2e54b353-...  name=""
  p3: LBA 10274..227361 (106 MiB) type=c12a7328-... name="EFI"
  p4: LBA 227362..235553 (4 MiB) type=ebd0a0a2-...  name="CIDATA"
  p5: LBA 237568..144941055 (70656 MiB) type=516e7cb4-... name="FreeBSD"
  p6: LBA 144941056..857972735 (348160 MiB) type=0fc63daf-... name="Debian"
  p7: LBA 857972736..891527167 (16384 MiB) type=0657fd6d-... name="swap"
  p8: LBA 891527168..895721471 (2048 MiB) type=51525611-... name="QRV"
devb-nvme: ready (9 path(s), first NS 488386 MiB 512 B/LBA)
br--r--r--  1     0     0 512110190592 nvme0n1
br--r--r--  1     0     0   2147483648 nvme0n1p8

Mounting /dev/nvme0n1p8 at /disk2...
fs-qrv: qrvfs v1, 524288 blocks, 4096 inodes
fs-qrv: mounted qrvfs at /disk2 (dev=/dev/nvme0n1p8)
drwxrwxr-x  2     0     0      768 bin

# cd /disk2
# ls
bin
# cd bin
# ls -la
drwxrwxr-x  2     0     0      768 .
drwxr-xr-x  3     0     0      768 ..
-rwxrwxr-x  1     0     0    60144 syscall_testing
# echo "syscall_testing is here, on NVMe!!!"
syscall_testing is here, on NVMe!!!
# ./syscall_testing
=== QRV Syscall Conformance Test Suite ===

[timers]   25 tests ... all PASS
[sync]     23 tests ... all PASS
[printf]   13 tests ... all PASS

=== Results: 61 tests, 61 passed, 0 failed ===
# shutdown
Shutting down...

What This Means

The partition layout tells the full story without commentary: FreeBSD, Debian, swap, and — at the end — QRV. That last entry was created in advance, left empty, and waited.

The syscall_testing binary is not a toy. It tests kernel call semantics: error returns, resource exhaustion, object reuse, the _r variants that must not touch errno. Sixty-one tests, all passing, executed from a QRV filesystem on a Samsung NVMe, on a RISC-V workstation, through a microkernel IPC stack.

This is a system.


What Remains

Versions 0.26 through 1.0 are still ahead: writable filesystem, proper signals in userspace, network stack, and much more. The documentation effort is just beginning — a User Manual, a Programmer's Manual, a Kernel Calls & Taskman Message Reference, and the Porting Story that has been accumulating as LaTeX chapters throughout this sprint.

But the picture that was forming in someone's mind back in 1999, through RadiOS in assembly, through the HEIG-VD sources found during a COVID lockdown, through a restarted project on a father's birthday in February 2026 — that picture is now a running system.

The work continues.

QRV v0.25: mount -t qrv /dev/nvme0n1p8 /disk2

That mount command works. On QEMU virt, a qrvfs filesystem written by the host mkfs-qrv tool to NVMe partition 8 is mounted at /disk2, browsable with ls, readable with cat, with all four resource managers — devb-nvme, fs-qrv on /dev/nvme0n1p8, devb-virtio, fs-qrv on /dev/vblk0 — running simultaneously in user space.

On the SiFive HiFive Unmatched, the DesignWare PCIe controller enumerates the full board topology: USB hub, Samsung MZVL2512HCJQ NVMe (144d:a80a), NVIDIA VGA, NVIDIA audio. devb-nvme reads the Samsung's GPT — eight partitions, ~2 GiB at partition 8 — and registers /dev/nvme0n1 and /dev/nvme0n1p8 in the QRV namespace.

That last partition has been sitting empty since July 2021, when the board arrived and a deliberate choice was made to leave space for an operating system that didn't exist yet. Today it does.


The sync_hash Collision

Before any of the above was possible, a fundamental kernel bug had to be found. The symptom: the second instance of any resmgr binary — fs-qrv spawned for /dev/nvme0n1p5 while another fs-qrv was already mounted on /dev/vblk0 — failed with EBUSY on pthread_mutex_init during resmgr_attach. The two processes were otherwise healthy. Killing one let the other proceed. Running them sequentially worked fine.

The root cause is in RISC-V's cpu_pageman_vaddrinfo. Rather than walking the user page tables to find the physical address backing a virtual address, it does a lexical subtraction: PA = VA - KERNEL_VIRT_BASE. For kernel addresses this is an identity operation — every kernel VA maps to exactly one PA. For user addresses from processes loaded at the same virtual base (no ASLR, same ET_DYN layout), two processes end up at identical virtual addresses for their arena, and the lexical subtraction produces identical "physical" addresses for both.

The kernel's sync_hash uses the physical address as the key. Two processes calling pthread_mutex_init at the same user VA produce the same (obj=_syspage_ptr, offset=VA-BIAS) key. The second one collides with the first's entry and gets EBUSY.

On x86_64 this was benign because cpu_pageman_vaddrinfo walks the actual page tables, so two processes at the same VA get genuinely different PAs. The bug was a RISC-V-specific shortcut that was harmless when only one instance of each resmgr ran, and fatal as soon as two instances of the same binary ran simultaneously.

Fix: in pageman_vaddr_to_memobj and pageman_munmap, return the calling tProcess * as the hash object for private mutexes, making the key (prp, offset) instead of (_syspage_ptr, offset). Per-process keys cannot collide across processes by construction.


The DesignWare iATU

Getting PCI enumeration to work on the Unmatched required understanding something non-obvious about the FU740's PCIe controller.

QEMU's pcie-ecam-generic host bridge provides a flat ECAM memory window: a contiguous region where the address encodes bus<<20 | slot<<15 | func<<12 | register. You mmap it once and read any device's config space by computing an offset. Straightforward.

The SiFive FU740 uses a Synopsys DesignWare PCIe controller. Its "config" memory window is not flat ECAM. The DW iATU encodes CFG TLPs as bus<<24 | slot<<19 | func<<16 | register — a different bit layout, wider per-field. A single iATU mapping cannot cover the whole bus space. The canonical pattern (FreeBSD pci_dw_read_config, Linux dw_pcie_other_conf_map_bus) is to re-aim outbound iATU region 0 to the specific (bus, slot, func) before each config access.

New file servers/pci/pci_dw_atu.c implements this:

  • pci_dw_atu_init(dbi_pa) mmaps 4 MiB of the DBI register space and auto-detects whether the controller uses legacy iATU (at DBI+0x900) or unroll mode (at DBI+0x300000) by reading DW_IATU_VIEWPORT — it returns 0xFFFFFFFF in unroll mode.
  • pci_dw_atu_aim_cfg(bus, slot, func, type, cfg_pa, cfg_size) programs region 0 with TYPE_CFG0 (immediate child, bus 1) or TYPE_CFG1 (downstream of bridge, bus ≥ 2), polls CTRL2.REGION_EN for completion.
  • Bus 0 (the controller's own pseudo-bridge) reads directly from DBI; non-zero devfunc on bus 0 returns 0xFFFFFFFF (no device).

The FDT parser was also reworked: the FU740's device tree uses reg-names and interrupt-names to distinguish the DBI window from the ECAM/config window and MSI interrupt. Previously the parser read them in order and assigned ecam_base to whichever reg entry came first — on the FU740 that is the DBI window (wrong), not the config window. The fix makes the commit block idempotent and re-reads on the second pass once reg-names is available, correcting ecam_base to reg[1].

On the Unmatched: unroll mode detected, full PCIe topology enumerated.


PCIe INTx: Three Bugs in One Session

Getting interrupt-driven NVMe working on QEMU required fixing three latent bugs in the PCIe INTx allocation path, all of which meant pci_device_read_irq returned 0.

Bug 1: libpci's pci_device_attach was missing PCI_INIT_IRQ from the attach flags sent to pci-server. Without it, the server's entire IRQ-allocation path was skipped unconditionally.

Bug 2: The server's IRQ resource pool was never seeded on RISC-V. The original QNX flow relied on BIOS-programmed Interrupt_Line registers that pci_enum could pci_reserve_irq() from. QEMU virt leaves Interrupt_Line at 0xFF. Nothing got reserved; every subsequent rsrcdbmgr_attach call failed. Fixed by seeding INTA..INTD (irq_base..irq_base+3) at ecam_attach() time.

Bug 3: ecam_avail_irq() returned the full unswizzled [INTA, INTB, INTC, INTD] list, so pci_alloc_irq picked the numerically lowest free IRQ rather than the pin-routed one. The correct result for a PCIe device is a single wire: irq_base + ((device + pin - 1) % 4). Read Interrupt_Pin from config space and return that one.

After all three: pci_device_read_irq on the QEMU NVMe at 00:02.0 with pin A returns PLIC IRQ 34 — the wire the device actually asserts.

devb-nvme then switched from polling to proper IST-driven command completion: a dedicated IST thread drains both admin and I/O CQs on each wake (INTx is shared between queues), latches (status, cid) into a per- queue mutex/condvar/done gate, signals the submitter, and unmasks the kernel vector. The polling path remains as fallback for admin commands during bringup before the IST starts.


The Synopsys DW MSI Controller (Stages 1–3)

For MSI-X support — which NVMe devices strongly prefer over INTx — the FU740's internal Synopsys DesignWare MSI controller needs to be brought up. It receives posted MSI writes from devices, sets bits in MSI_INTR_STATUS_0, and asserts one aggregate PLIC IRQ (vector 56 on the Unmatched).

Three stages landed this release:

Stage 1: FDT parsing of the DBI window and MSI IRQ number. The kernel publishes a dw_pcie_msi child device under the PCI bus, carrying the DBI window location and aggregate PLIC IRQ. On QEMU virt nothing publishes this; the existing INTx path is completely unchanged.

Stage 2: ecam.c consumes the child device when present, picking up the DBI window address and MSI IRQ for later use.

Stage 3: pci_dwmsi.c — the controller bringup driver. Maps 4 KiB of DBI (the DTB declares a 2 GiB window; mapping it in full would consume pci-server's entire address space, so it is capped), allocates an anonymous page as the MSI target, programs ADDR_LO/ADDR_HI/ INTR_ENABLE/STATUS, and attaches an IST to the aggregate PLIC IRQ. Each MSI fire is logged. Per-vector pulse dispatch to waiting drivers is Stage 4, deferred.


Two Other Bugs Worth Noting

Recursive kmutex initialization. mutex_init() with PTHREAD_RECURSIVE_ENABLE was using KMUTEX_RMUTEX_INIT, which sets owner to QRV_SYNC_INITIALIZER (0xffffffff). That sentinel is for statically-initialized user-space pthread mutexes; kernel mutexes never go through the sync subsystem and their mutex_lock() spins waiting for owner==0. 0xffffffff never becomes 0. Any lock of a recursively- initialized kmutex spun forever.

Separately, taskman/sys/support.c's per-process lock was being created with NULL attr (non-recursive + error-check), while reentrant taskman paths that take the same process's mutex twice on one thread — QueryObject → handler → proc_lock_pid(same pid) — hit EDEADLK and crashed. Both fixed; together they were taking down taskman on the Unmatched during pci-server startup.

vfprintf NUL-byte truncation. PRINT(ox, 2) and PRINT(&sign, 1) stash pointers to stack-local variables into the IOV vector for deferred __sfvwrite consumption. Both locals get reset to '\0' at the top of every new format specifier. When one format spec's PRINT'd pointer was still queued and the next spec's reset fired first, the pointed-to bytes became NUL before __sfvwrite read them — emitting a stray NUL mid-output that truncated any consumer reading the buffer as a C string.

The symptom: pci_slogf("%#lx PLIC IRQ %u", ...) on the Unmatched produced "0xe00000000/0" — the 0x prefix of the second %#lx became "0" + NUL when the trailing %u iteration wiped ox[1] before the flush. Fix: drain the IOV via FLUSH() at the end of the pforw block so the locals are consumed before the next iteration resets them. Thirteen new printf regression tests, all passing.


The Test Suite

61 tests across three subsystems now run on every build: 25 for TimerCreate/TimerDestroy, 23 for the sync subsystem (mutexes, condvars, semaphores — including error paths, resource exhaustion, and object reuse), and 13 for printf (including the %#lx+%u regression that caught the NUL bug). All 61 pass on QEMU virt.


The Partition That Waited

The CHANGELOG notes it plainly: devb-nvme on the Unmatched read the Samsung's GPT and found 8 partitions.

That last partition was formatted and left empty in July 2021 when the board first arrived. The intent, at the time, was to eventually fill it with a custom filesystem on a QNX-compatible microkernel ported to RISC-V. Four years and a few months later, mount -t qrv /dev/nvme0n1p8 /disk2 is a working command.