System log

In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Wednesday, May 06, 2026

QRV v0.34–v0.40: Taskman Moves to User Mode

Since the project began, taskman has run in S-mode — the RISC-V supervisor mode, shared with the kernel. It was kerlinked into the kernel address space at boot, called kernel functions by name, and reached kernel data structures directly. This was inherited from QNX's procnto-as-kernel-module model and it worked, but it was architecturally wrong: there is no meaningful trust boundary between the kernel and taskman when both run in the same address space under the same privilege level.

v0.40 completes the migration of taskman to U-mode. Taskman now runs as an ordinary user-space process at address 0xC0000000, with its own page tables, communicating with the kernel exclusively through the syscall_tm_priv mechanism. The kernel heap has PTE_U=0. A bug in taskman can no longer corrupt kernel data by accident.

This took about a week of intensive work spread across versions 0.34 through 0.40. Here is how it went.


v0.34–v0.35 — Groundwork: Relocating Bodies, Retiring __Ring0

The first problem was naming. The historical __Ring0(func_ptr, arg) primitive — kernel calls a function pointer supplied by taskman, executed with kernel privilege — was an x86 concept with x86 terminology. RISC-V has no rings. And beyond the name, the mechanism was architecturally dangerous: the kernel jumped to whatever address taskman supplied, with no whitelist beyond a capability flag.

syscall_tm_priv replaces __Ring0. The new mechanism: <sys/kercalls.h> defines a __KER_TM_* sub-namespace of 53 enum constants. syscall_tm_priv(id, arg) is a plain ecall that dispatches through a static kernel-owned table (ker_tm_priv.c), gated on QRV_FLG_PROC_TM_PRIV. The table is a whitelist. A _Static_assert catches enum/table drift at compile time. Every EXPORT_SYMBOL(kerext_*) entry removed from the kernel's symbol table — the kerext bodies are truly kernel-internal now.

This rename was a tree-wide sweep: ~90 call sites, one sed pass.

Phase 0a–0c: kerext bodies moved to kernel. Before taskman text can be marked PTE_U=1, every function that the kernel calls through a function pointer must live in kernel text. S-mode instruction fetch from PTE_U=1 pages is illegal regardless of SUM. Three phases relocated the bodies:

  • Phase 0a: kerext_reparent, SignalKill, and the five exit-path bodies (kerext_pulse_deliver, exit_destroy_threads, channel_destroy, timer_destroy, connect_detach) moved to kernel/kext/.
  • Phase 0b: sysaddr_map, ext_vaddrinfo, ker_manipulate moved via a cross-boundary helper table — the bodies called back into taskman helpers (do_manipulation, pte_map) that still lived in taskman text.
  • Phase 0c: The entire pa allocator cluster (kerext_pa_alloc, _alloc_given, _free, _free_info) moved alongside a new kerext_mm_xboundary table registering 10 function pointers to taskman-side helpers.

Phase 0 ends with the condition: every __Ring0 callback body lives in kernel text. The SPP-flip preconditions are in place.

Phase 1a–1c: arch helpers moved to kernel. pageman_aspace — called on every context switch via memmgr_p->aspace — moved to kernel/arch/riscv/pageman_aspace.c. The Sv39 PTE engine (cpu_pte_manipulate, cpu_pte_merge, cpu_pte_split, sv39_walk, prot_to_pte_bits) moved to kernel/arch/riscv/cpu_pte.c. cpu_sysvaddr_find and cpu_pageman_vaddrinfo moved alongside. A kx_* indirection that was now pointing kernel symbol to kernel symbol was collapsed to direct calls.


v0.35 — pa.c Migrates to the Kernel

The physical-quantum allocator had always been semantically wrong in taskman: physical memory belongs in the kernel by every microkernel convention. After Phase 0c moved the kerext bodies, moving the allocator itself followed naturally.

kernel/kext/pa.c (~1900 lines) now carries all of it: the global state (blk_head[], mem_free_size, mem_reserved_size, restriction chain, quantum pool), the pure helpers (pa_carve, _pa_free, pa_quantum_to_paddr, enqueue_run, dequeue_run), init code, and six new __KER_TM_PA_* slots.

Taskman's pa.c was replaced with a thin syscall-wrapper file (~380 lines). The parts that stayed taskman-side: pa_alloc_fake / pa_free_fake (fake quantum allocator using taskman heap), the high-level pa_alloc retry/purger loop, and pa_free_info's restriction-filter walk.

Six kernel-data globals that taskman had been reaching into directly (mem_free_size, blk_head, etc.) now live on the kernel side and are accessed only through kerext calls. Net leak from kernel data into taskman binary: zero.


v0.36 — kernel/mm/ Promoted to First Class

The kernel needed its own permanent memory manager — one that lives in S-mode kernel text and is callable without indirecting through taskman. Pre-v0.36 the kernel had emm.c ("early memory manager"), a bootstrap stub that handed off to taskman's pageman table at startup. After the U-mode flip, memmgr_p->FOO calls could no longer land in taskman text.

kernel/emm.ckernel/mm/mm.c, symbols emm_*mm_*. init_memmgr wires the full table (mmap, munmap, vaddr_to_memobj, vaddrinfo, mcreate, mdestroy, aspace, pagesize) to kernel-resident symbols:

  • mm_vaddrinfo (new): read-only Sv39 walk returning the leaf PTE, using cpu_pte_lookup. Correct and conservative; no memobj/mm_map walk.
  • mm_mcreate / mm_mdestroy (new): process-aspace setup via cpu_pgdir_create. Allocates pgdir + tAddress. Deliberately omits the mm_map list, rwlock, rlimit — taskman owns that bookkeeping.

kerext_register_memmgr and kerext_register_procmgr defanged to no-ops. The kernel never indirects through taskman text again.


v0.37–v0.38 — First U-Mode Boot and Its Bugs

With the groundwork in place, the Phase 4 work began: actually running taskman in U-mode. This was not a clean single commit — it was a debugging session across multiple evenings, each commit fixing the current crash and revealing the next one.

Phase 4a (v0.34): The predicate for S→U-mode thread entry changed from !QRV_FLG_PROC_RING0 to !QRV_FLG_PROC_LOADING. The distinction matters: taskman has RING0 set permanently (it's a privileged process), but its pool threads should run U-mode. Loader threads — which run in the child process's context during spawn — should stay S-mode until ProcessStartup promotes them.

Phase 4b: Marking taskman text PTE_U=1 in kerlink. This made taskman reachable from U-mode but also made it readable from every user process (the kernel L2 is shared). Acknowledged as a transitional step; Phase 5's ET_DYN taskman in its own address space fixes the isolation.

The first boot to message_start: Eight orthogonal fixes in one commit that together carried taskman through all init phases in U-mode: PTE_U=1 on the kernel heap, SUM set early, a kerext for reading satp from U-mode, CPIO ramdisk PA → high VA conversion, rlimits initialised for taskman, PROC_LOADING cleared after the first taskman thread, pool stacks pre-allocated (96 × 8 KiB via pageman_mmap before thread_pool_start), and aspace_prp = prp set uniformly for all pool threads.

main() must not return (v0.38). Taskman is now a real ET_EXEC. When taskman_main returns, libc's crt0 calls exit()_exit()_PROC_EXIT → procmgr → ProcessThreadsDestroy → tear down all of taskman's threads. Fix: pthread_exit(NULL) instead of return.

The NOZOMBIE race (v0.38). A subtle spawn-ordering bug: every process is created with QRV_FLG_PROC_NOZOMBIE set. Before the U-mode flip, proc_start cleared it for normal spawns. After the flip, kerext_register_procmgr is a no-op, so procmgr_p->process_start is NULL, and NOZOMBIE stays set permanently. Every parent's waitpid returns ECHILD immediately. The init script became fire-and-forget instead of synchronous. Fix: replicate the NOZOMBIE clearing in proc_loader.c.

mm_map_xfer wired (v0.37). memmgr_p->map_xfer had been NULL since the kernel/mm/ promotion — left as a stub with an explicit "deferred" comment. The first time a pool thread's msg_replyv had a cross-aspace destination, the kernel called NULL. The fault chain was interesting: the NULL jalr → PC=0 → xfer fault recovery path → _longjmp into a stale jmp_buf whose bytes held kprintf output → PC = 0xffffffc0000a3036 → second instruction page fault → "XFER BUG" banner. mm_map_xfer is now a proper Sv39 walk.

LP64 intrinfo truncation (v0.38). init_intrinfo runs before switch_to_high_va. GCC under -mcmodel=medany resolves symbol addresses PC-relative at the PA, not the linked VA. The stored mask/unmask function pointers were PA values (0x80207b88) instead of kernel VAs (0xffffffc080207b88). Every subsequent ilp->info->unmask() call jumped to low-half user space, which isn't mapped. Fix: KVA(sym) — OR the symbol address with KERNEL_VIRT_BASE.


v0.39–v0.40 — Phase 5: The Real Architecture

Phase 5 was the fundamental ABI redesign. All of the Phase 4 work had taskman running in U-mode while still touching kernel heap data directly (the heap was PTE_U=1). Phase 5 flips the heap to PTE_U=0 and builds the full structure to make U-mode taskman correct.

tTMprocess: taskman-side process state. The kernel-allocated tProcess contains kernel-managed fields. But it also carried POSIX-policy fields (pgrp, sid, session, umask, root, cwd, guardian, siginfo, events, resource lists, conf table) that belong in taskman. Under PTE_U=0, taskman can no longer dereference a tProcess * for any of these.

The solution: taskman/include/tm_process.h defines tTMprocess — a fixed-size BSS array of TM_PROCESS_MAX slots indexed by PINDEX(pid). Every POSIX-policy field migrated from kernel-half tProcess to tTMprocess. The locking state for the address space (rwlock_lock, fault_owner) migrated from tAddress to tTMprocess. The kernel retains only what it genuinely owns: scheduling state, IPC vectors, trap/fault state, credentials.

KERNEL_INTERNAL struct gating. Every kernel-private struct body (tProcess, tThread, tChannel, tConnect, and 15 others) is now gated behind #ifdef KERNEL_INTERNAL. Kernel sources define KERNEL_INTERNAL and see full struct bodies. Taskman sees only the typedef pointer-to-incomplete. Any prp->X deref in taskman code is now a compile error — by design.

pa_pq accessors. struct pa_quantum lives on the kernel heap (PTE_U=0). Direct field access from U-mode taskman faults. The new pa_pq_flags() / pa_pq_blk() / pa_pq_run() / pa_pq_modify_flags() inline wrappers route through __KER_TM_PA_PQ_OP.

80 TM_PRIV slots. The dispatch table grew from 53 to 80 entries during Phase 5. 25 new slots cover everything taskman needs to do that involves kernel data: PA_ALLOC_FULL, PA_PQ_OP, ASPACE_MEMCPY, KPAGE_ZERO, REGISTER_POOL_STACK, PROC_QUERY, PROC_SET_SESSION, CHANNEL_HAS_RECEIVERS, MMAP_PHYS_USER, MMAP_ANON_USER, and more.

taskman linked as ET_EXEC at 0xC0000000. The old ET_DYN-against-libc.qrl build required libc.qrl to live in the kernel heap (PTE_U=0 after Phase 5 → taskman can't reach it). The new build: -nostdlib -static -Ttext-segment=0xC0000000, libc.a inside the link group. Every libc symbol now lives inside taskman's own image. No DT_NEEDED, no PLT/GOT for libc.

0xC0000000 is Sv39 L2[3]. cpu_pgdir_create copies only L2[1] and L2[256..511] into every user pgdir. L2[3] stays zero in every user process — taskman's image is invisible to other processes.

The loader thread problem. During spawn, a loader thread runs in the child process's context but needs to fetch taskman text. The solution: mm_taskman_view_install(child_prp) temporarily copies taskman's L1 page pointer into the child's pgdir at L2[3]. ProcessStartup calls mm_taskman_view_remove(child_prp) before the first U-mode sret into the child — at that point the child is fully isolated.

Pool stacks registered by taskman. The kernel's procmgr_stack_alloc fallback to memmgr_p->mmap was removed — the kernel heap is PTE_U=0, unusable as a user thread stack. Taskman pre-allocates 96 user-VA stacks via TaskmanMmapAnonUser and pushes each into the kernel's free list via __KER_TM_REGISTER_POOL_STACK. The kernel contributes only the free-list structure; taskman contributes the memory.

826 → 0 warnings. The final commit of the cycle drove the warning count from 826 to zero. 12 were genuine QRV-proper issues fixed at the root. 814 were -Wconversion in the mksh fork — suppressed at the Makefile level with an explicit rationale: the mksh fork is diverging and will eventually be renamed qsh; fixing 800 upstream conversion sites would cost effort that belongs in the architectural work instead.


What This Means

Taskman is now an ordinary user-space process. It communicates with the kernel through a defined, whitelist-gated ecall interface. A bug in taskman — a corrupted pointer, an out-of-bounds write — cannot reach kernel data. The KERNEL_INTERNAL guard makes this a compile-time property, not just a runtime one: the compiler rejects any attempt to dereference a kernel struct from taskman code.

The BKL work and the U-mode work ran on sequential branches. They are now merged into a single tree. The system boots, logs in, and runs on real hardware with both changes in place.

Friday, May 01, 2026

QRV v0.28–v0.33: Breaking Up the Big Kernel Lock

Two ambitious project branches emerged from the v0.27 milestone. The first is removing the Big Kernel Lock. The second — lifting taskman to user mode — gets its own post. This one covers v0.28 through v0.33: the foundation work, the first measurements, and the first kercall to run without holding the BKL.


v0.28 — Phase 0: Infrastructure and Baselines

Before any lock can be removed, you need to know what you're removing. Phase 0 landed two things: the structural plumbing for fine-grained locking, and the instrumentation to measure the current BKL cost.

Per-CPU infrastructure. kernel/include/percpu.h introduces DECLARE_PER_CPU / DEFINE_PER_CPU / this_cpu_{ptr,read,write,inc,dec} — thin macros over plain [PROCESSORS_MAX] arrays indexed by RISC-V hart ID. Alongside it: kernel/include/klock.h defining kspinlock_t, a lock-class rank ordering (KLOCK_CLASS_SCHED, KLOCK_CLASS_SYNC, KLOCK_CLASS_CHANNEL, etc.) for a future lockdep, and a per-CPU preempt_count word.

am_inkernel() triage. The twelve call sites of am_inkernel() turned out to express two different things. bkl_held() means this CPU holds the BKL. in_kernel_context() means this CPU is in a kernel critical section of any kind. Today both have the same body — but once Phase 1 introduces per-subsystem locks they will genuinely diverge.

bkl-stress and the first baseline. A synthetic stresser that spawns one pthread per hart and hammers one of three workloads: mutex (every lock/unlock hits SyncMutexLock/SyncMutexUnlock), sleep (1 ms nanosleep loops), or nop (pure user-mode, control). Results from a five-second QEMU -smp 4 run:

boot baseline:              34.6 % BKL contention rate
bkl-stress sleep 5s:        58.6 % contention, spread across 4 harts
bkl-stress mutex 5s:        71 %  contention, ~2.17 M iter/s

Those numbers are the target. Phase 1 will quote against them.

BKL instrumentation. kernel/bkl_stats.c adds per-CPU counters (acquires, contended acquires, spin cycles, held cycles, log2 histogram of spin durations) around acquire_kernel / release_kernel, all gated behind CONFIG_BKL_STATS (default off). The taskman sysfs grew a synth() callback so /sys/bkl can snapshot the kernel counters at any open. When the config is off, the hot path is byte-identical to v0.27.

First measured baseline on a fresh boot: ~28,800 BKL acquisitions across four harts, 34.6 % contention rate, spin distribution peaking in the 2^14–2^17 cycle range. That is roughly 9.8 kcyc per BKL critical section on average.


v0.29 — Phase 1.1–1.2: Fine-Grained Locks Declared

Two structural locks declared, neither yet contended — that is the deliberate pattern.

sync_slock (Phase 1.1) wraps every read and write of sync_hash and the tSync chains hanging off it. Every sync kercall still holds the BKL around the entire operation, so sync_slock is uncontended and idempotent today. It exists to tag every sync-hash touchpoint so that Phase 1.5 can lift the BKL from sync kercalls without re-auditing every call site. One notable exception: synchash_rem was left BKL-only, because its inner unlock_kernel/lock_kernel cycle to safely probe a user-space sync->owner makes a naive spinlock wrap unbounded.

sched_slock (Phase 1.2) declared alongside sync_slock. The scheduler primitives (force_ready, ready_default, block, unready) compose more aggressively than the sync primitives — force_ready calls ready, unready calls block, block calls select_thread via FIND_HIGHEST — and ready_default alone has five early-return paths. Threading a lock through all of that without knowing exactly which paths will run BKL-less is premature. The lock is there; the wrapping waits for Phase 1.5.

CONFIG_KLOCK_DEBUG. An embryonic lockdep: kspinlock_t gains a class field; kspinlock_lock checks that the new class ranks strictly above every class on the per-CPU held-stack. On violation: kprintf + crash(). Default off; when on, a bkl-stress run over Phase 1.1 produces zero violations.


v0.30 — Phase 1.5a–e: First BKL Lift

The distance between "lock declared" and "BKL removed from a kercall" is a series of preparatory steps that each need to be correct before the next one is safe.

P1.5a. synchash_rem wrapped in sync_slock, dropping the lock around the user-space sync->owner probe. The prerequisite that had been deferred from Phase 1.1.

P1.5b. sync_wakeup, mutex_holdlist_add, and mutex_holdlist_rem now take sync_slock internally around their wait-list mutations. They drop the lock across ready() calls — SCHED < SYNC in the rank order, so calling a scheduler primitive with a sync lock held would be a rank inversion.

P1.5c. ready_default and adjust_priority_default wrapped in sched_slock. Five exit paths, drop-and-retake around calls that recurse.

P1.5d. force_ready's STATE_MUTEX block and adjust_priority's STATE_CONDVAR/SEM/MUTEX branches now take sync_slock around their wait-list mutations. A _locked variant of mutex_holdlist_rem avoids recursive acquire.

P1.5e — the first lift. __KER_SYNC_MUTEX_UNLOCK marked in kercall_no_bkl. __kercall_dispatch calls release_kernel() after entry-time bookkeeping, runs the handler, and re-acquires before the tail. The body of sync_mutex_unlock was restructured: *owp = 0 stores switched to _smp_xchg_ul(owp, ...) (amoswap.d.aqrl) for acquire+release semantics; every wait-list read under sync_slock.

A bug caught in passing: atomic_set_64(owp, value) emits amoor.d (bit-set, not assign). ORing zero into *owp is a no-op — the mutex never released, system hung. The correct primitive is _smp_xchg_ul. This discovery also led to a tree-wide rename: atomic_setatomic_or everywhere, to match actual semantics.

Results from P1.5e:


P1.5d (BKL) P1.5e (lifted)
Acquires 1.94 M 1.42 M
Contended 1.55 M (80 %) 1.35 M (95 %)
Spin cycles 8.51 Gcyc 4.18 Gcyc (−51 %)
Held cycles 23.5 Gcyc 12.25 Gcyc (−48 %)
Held cyc/acquire 12.1 kcyc 8.6 kcyc (−29 %)

Half the BKL spin time and half the BKL held time. That is the first real measurement of what BKL removal is worth.


v0.30 (continued) — Phases 1.5f–m: All Sync Kercalls Lifted

The remaining six sync kercalls followed in sequence.

P1.5f lifted __KER_SYNC_CONDVAR_SIGNAL. Body is a single sync_wakeup call, which was already self-locking since P1.5b. Straightforward.

P1.5g wrapped block, unready, and block_and_ready in sched_slock. These three are the primitives the upcoming condvar/sem/mutex lifts need.

P1.5h wrapped the standalone pril_add/pril_rem call sites in ker_sync.c in sync_slock. Prerequisite for the block-path lifts.

P1.5i introduced QRV_ITF_WAIT_PENDING — a per-thread flag set during the transition window between adding a thread to a wait list and completing the unready. Wakers on other CPUs spin-wait until the flag clears. This is the primitive that makes the remaining lifts safe: without it, a thread can be on the wait list while still STATE_RUNNING, and a concurrent ready() would CRASHCHECK.

P1.5j–k lifted sem_post and sem_wait. One bug surfaced in sem_wait: block() (called inside unready) clears thp->blocked_on = NULL in its default case. The first draft of the body assigned blocked_on = syp before unready, which got clobbered. Fix: assign after unready, with WAIT_PENDING bridging the window.

P1.5l lifted condvar_wait. The coupling between kmutex_t and condvar was the hard part: kmutex_t never enters sync_hash (libc's cmpxchg fast-path short-circuits the kernel), so sync_lookup's auto-init failed after a wakeup on a system without the BKL. Fix: force-register the bound mutex in sync_hash from the condvar_wait body before releasing it.

P1.5m lifted mutex_lock and mutex_revive — the last two. The priority-inheritance walk through mutex_lock is the most complex body in the sync subsystem. It stays "best-effort" under contention: stale reads in the walk degrade priority inheritance but don't break correctness. The canonical waiter pattern from P1.5i applies here too.

Hardware fences. An important correction mid-lift: the inline __asm__ volatile("" ::: "memory") barriers in the WAIT_PENDING handshake were compiler-only. RVWMO permits store-store and load-store reordering at hardware level. Replaced with cpu_smp_mb() — a portable abstraction defined per-arch as fence rw, rw on RISC-V and mfence on x86_64.

Phase 1.5 exits with all seven sync kercalls running BKL-less.


v0.31 — PRIL/LINK3 Alias Corruption

After Phase 1.5 completed, a rare but real crash surfaced during devb-nvme bringup:

*** PRIL CORRUPTION: HEAD prev.prio_tail priority mismatch ***
CRASH: nano/nano_misc.c:328

Root cause: PRIL_ENTRY_FIELDS puts a union at the start of tThread. The same first 16 bytes serve as either a PRIL wait-list (next.pril, prev.prio_tail) or a LINK3 dispatch list (next.thread, prev.thread). A thread cannot legally be on both simultaneously.

The P1.5l-lifted ker_sync_condvar_wait does: pril_add (puts act on wait list), then sync_mutex_unlock (which calls ready(waiter)ready_default, which in the SMP "can replace this CPU" path calls LINK3_BEG(dispatch_queue, act, tThread) — writing act->next.thread and act->prev.thread, the same bytes as the PRIL fields just set). The prev.prio_tail that should equal act (for a HEAD+TAIL single-element list) now holds a stale dispatch-queue backlink.

Fix: in ready_default, before bumping act, check QRV_ITF_WAIT_PENDING. If set, fall through — act will finish its own transition and ker_exit will pick it up. The bump is skipped.

Verification: 20/20 clean boots, zero PRIL corruption.


v0.32–v0.33 — Channel Lock and Trace Ring

v0.32 began Phase 2 of BKL removal: the IPC path. channel_slock declared (rank between INTREVENT and CONNECT). The wait-list primitives in nano_message.c (remove_unblock, net_send2, net_sendmsg) and the channel-destroy and pulse paths in nano_pulse.c and ker_channel.c now take channel_slock around their queue mutations. All of this is still uncontended — the BKL is still held by every caller. The four target IPC kercalls (MsgSend, MsgSendPulse, MsgReceive, MsgReply) got their queue-op sites wrapped and WAIT_PENDING spin-waits inserted in preparation for the eventual BKL lift.

v0.33 added a lockless per-CPU function-entry trace ring: 32 entries per CPU, a TRACE_FN(id, ctx, arg) macro, and a trace_ring_dump(cpu) called from the crash path. During the BKL hunt this was the tool that revealed how the __KER_MSG_RECEIVEV canonical waiter pattern interacted with the timer-trap-handler's in_kernel_context() routing to leak BSS globals into active thread register save areas. The ring stays in the tree — it will be useful again.


Where Things Stand

Seven sync kercalls run BKL-less. Phase 2 infrastructure is in place for the IPC path. The BKL is not gone — that was never the claim — but the structural work to remove it section by section is well underway, documented, and measured.

Monday, April 27, 2026

QRV v0.27: Multi-User Login — Two Months

Exactly two months ago today, on February 27, the project restarted.

v0.27 is the milestone that closes that chapter. The system boots on real RISC-V hardware, enumerates a Samsung NVMe, mounts a filesystem, runs a staged init sequence, presents a login prompt, authenticates a user against a shadow password file using SHA-512 crypt(3), drops privileges, and hands off to mksh with the correct working directory and environment. The session below is from the SiFive HiFive Unmatched.

=> run bootqrv

397328 bytes read in 0 ms
4421120 bytes read in 4 ms (1 GiB/s)

Starting kernel ...

Boot command line: -Dsbi mainfs=/dev/nvme0n1p8
Board: SiFive HiFive Unmatched A00
Compatible: sifive,hifive-unmatched-a00
+------------------------------------------+
| QRV Operating System Kernel version 0.27 |
+------------------------------------------+
init_clocks
init_raminfo
init_mmu
  ram_top = 480000000
  prealloc_kernel_l2: filled 239 empty kernel L2 slots
  enable_mmu: Sv39 paging active
init_intrinfo: timer(base=5,n=1) plic(base=32,n=128)
init_cpuinfo: 4 CPUs, rv64imac, 1 MHz
hwinfo: serial sifive,fu740-c000-uart @ 10010000 irq 39
hwinfo: pci sifive,fu740-pcie ecam df0000000/10000000 irq 57 windows 4
...
[2] kerlink: boot/taskman.qkx: resolved 6190/6190 external symbols (0 unresolved)
[2] kerlink: boot/taskman.qkx: applied 10756/10756 relocations (0 skipped)
...
[2] cpu_start_others: hart 1 started
[2] cpu_start_others: hart 3 started
[2] cpu_start_others: hart 4 started
[2] bootimage: spawned /sbin/init (pid 2)

***********************************
* Welcome to QRV Operating System *
***********************************

Starting serial console (devc-sersifive)...
system console set to /dev/ser1

Probing for NVMe...
nvme:   Model:    "SAMSUNG MZVL2512HCJQ-00B00"
nvme:   Serial:   "S675NF0R487649"
nvme: ns 1  capacity=488386 MiB
devb-nvme: GPT OK, 8 partition(s)
  ...
  p6: LBA 144941056..857972735 (348160 MiB) name="Debian"
  p7: LBA 857972736..891527167 (16384 MiB)  name="swap"
  p8: LBA 891527168..895721471 (2048 MiB)   name="QRV"

Mounting /dev/nvme0n1p8 on /usr...
fs-qrv: qrvfs v2, 4096 blocks, 128 inodes
fs-qrv: mounted qrvfs at /usr (dev=/dev/nvme0n1p8)

Sysinit: level1 running.

QRV 0.27 (2026-04-28) on sifive,hifive-unmatched-a00

login: qrvuser
Password:

Welcome to QRV!

[/home/qrvuser]$ echo "Welcome to QRV on SiFive Unmatched!"
Welcome to QRV on SiFive Unmatched!
[/home/qrvuser]$

What This Required

Filesystem namespace flip

The most visible architectural change: the CPIO modpkg now mounts at / instead of /rd/, and the NVMe filesystem mounts at /usr. The init script, all binaries, and drivers live under the cpio root's bin/, sbin/, lib/. Everything that belongs on persistent storage — config, user home directories, extended tools — lives under /usr, contributed by the qrvfs partition on NVMe. /etc resolves to /usr/conf via a path-manager symlink. /home resolves to /usr/home the same way.

This is the Linux initramfs/rootfs split, applied to QRV's architecture. It means the init script can be edited on disk without rebuilding the CPIO image. It means /etc/passwd is just a file on the NVMe partition.

Staged init

The cpio /sbin/init is a minimal mksh script — about 70 lines — whose only job is to get the console up, probe NVMe, and mount /usr. If that fails, it drops to a rescue shell. If it succeeds, it hands off to /usr/sbin/sysinit/level1.sh on the mounted disk.

level1.sh is where the real system initialization lives: pci-server, devb-nvme, fs-qrv, slogger, and finally getty. Because it lives on the NVMe partition, it can be edited, extended, and structured however needed — conditionals, sourced fragments from /usr/conf/sysinit/*.sh, anything mksh can express. A BSD-style rc framework can grow from here without touching the cpio image.

Platform detection in init

The init script no longer starts both UART drivers and lets the wrong one exit silently. It reads /sys/board — a new sysfs file populated from the FDT /compatible property at boot — and picks the correct driver:

read -r BOARD < /sys/board
case "$BOARD" in
    *sifive*) serdrv=devc-sersifive ;;
    *)        serdrv=devc-ser8250   ;;
esac
$serdrv

On QEMU virt: BOARD=riscv-virtio, driver=8250. On the Unmatched: BOARD=sifive,hifive-unmatched-a00, driver=sersifive. One init script, both platforms.

/sys filesystem

A new sysfs resource manager serves /sys — a Linux-style companion to /proc for kernel self-description. Initial entries: /sys/cmdline (the kernel boot command line, verbatim) and /sys/version (version string and build date). /sys/board was added during platform detection work. The kernel snapshots the command line before its in-place token split, so /sys/cmdline always reflects exactly what U-Boot passed.

Canonical-mode line discipline

Both UART drivers (devc-ser8250 and devc-sersifive) now implement a proper POSIX line discipline in ICANON mode: input accumulates in a line buffer, a line is delivered to the reader only on \n, and backspace erases correctly — both ASCII BS (0x08) and DEL (0x7F) are accepted, because different terminal emulators send different bytes. ECHOE echoes \b \b per erase. Without this, login: and Password: prompts couldn't be corrected with backspace before pressing Enter.

crypt(3) and the full authentication chain

The complete authentication chain is now in place:

  • SHA-512 implementation lifted from FreeBSD (sys/crypto/sha2/sha512c.c, stripped to exactly what crypt(3) needs)
  • SHA-crypt ($6$salt$hash) per Drepper's spec, from FreeBSD's lib/libcrypt/crypt-sha512.c
  • getpass(3) disabling ECHO via tcsetattr for password entry
  • getpwnam / getspnam reading /etc/passwd and /etc/shadow
  • login(1) comparing the entered password against the shadow hash, then calling setgid() + setuid() and exec'ing the user shell

getuid() / getgid() and their effective variants are now real: they call ConnectClientInfo(-1, &info, 0) — a kernel call that returns the calling process's own credential record directly, with no message-passing round-trip and no self-message hazard. This is the openqnx pattern and it is the right one: credentials come from the kernel's own credential store, not from a procmgr query that would deadlock if called from within taskman itself.

Getty and login

/sbin/getty opens the terminal, prints the banner (QRV <version> (<date>) on <board>), reads the login name via fgets, and spawns /bin/login. It self-respawns on shell exit, so a disconnected session automatically presents a new login prompt. /bin/login validates credentials, sets HOME, PATH, SHELL, USER, LOGNAME via setenv(3), changes to the home directory, and exec's the shell.

[/home/qrvuser]$ in the prompt is \w expanded by mksh's prompt preprocessor at every redraw — the canonical directory, through the /home → /usr/home symlink, shown correctly rather than the raw resolved path.


Two Months

February 27 to April 27. In that time:

  • Kernel booted to idle
  • First ecall, first channel, Sv39 paging
  • taskman loaded and running, 300+ symbols resolved
  • Full IPC round-trip, dynamic linking, user-mode shell
  • SMP stabilized
  • PCI server, NVMe driver, GPT partitions
  • Filesystem (qrvfs), mksh, signals, setjmp
  • Multi-user login on a RISC-V workstation with 16 GiB of RAM and a 488 GiB Samsung NVMe

The partition that was left empty in July 2021 is now /usr.

The work continues.

Sunday, April 26, 2026

QRV v0.26: mksh — A Real Shell for a Real System

esh served its purpose. It got QRV to a shell prompt, proved the IPC stack, and ran through two months of bring-up. But it has no variables, no conditionals, no scripting. What QRV has needed for a while — and now has — is a real shell.

v0.26 lands mksh, the MirBSD Korn Shell (R59, 2025-04-26), ported from scratch in a single day. It runs interactively, handles history with cursor keys, shows the current directory in the prompt, and runs sequences of commands without exiting:

MIRBSD KSH QRV-1 (mksh R59 base)
[/]# ls
bin  dev  proc  rd
[/]# cd /disk2/bin
[/disk2/bin]# ls -la
[/disk2/bin]# ./syscall_testing
...
=== Results: 89 tests, 89 passed, 0 failed ===
[/disk2/bin]#

The [/disk2/bin]# prompt is live: mksh re-evaluates $PWD on every prompt redraw. The cd you just ran is reflected immediately.

This matters beyond the convenience. A real shell means the init script that lives in modpkg.cpio — the script that starts every daemon at boot — can now have conditionals, variables, functions, and proper error handling. That is the foundation of a BSD-style rc system: hardware probing, conditional driver loading, different boot scenarios from a single init script. The modest CPIO image suddenly has real expressive power.


What the Port Required

mksh is about 37,000 lines of C. The upstream codebase carries three decades of platform accretion: EBCDIC support, OS/2, z/OS, Windows, AIX, Solaris, multiple editor modes (emacs, vi, gmacs), a German-language boolean type (Wahr/Ja/Nee/isWahr), and a 3,147-line monolithic sh.h that aggregates everything from every platform the shell has ever run on.

None of that goes into QRV. The port started with explicit disciplines: Apache-2.0 relicensing (enabled by the MirOS sublicensing right, with the original MirOS text preserved), K&R formatting, no fork() by design, poll() instead of select() as a QRV project rule, and every line earning its place.

sh.h: from 3,147 lines to 244

The original sh.h was replaced with a 244-line foundation containing only what the .c files actually use. Everything else was extracted into topical headers: env.h, shf.h, cclass.h, var.h, lex.h, proto.h, tree.h, msgs.h. Each domain in its own file, with a one-line rationale for every declaration that remains.

German booleans

931 substitutions, automated:

Wahr   → bool      (296 hits)
Ja     → true      (292)
Nee    → false     (309)
isWahr → ((bool))  (34)

These were fine in mksh. They have no place in QRV.

The line editor

mksh's edit.c is 5,832 lines: emacs mode, vi mode, gmacs mode, kill rings, undo, configurable bind tables, modal editing, vi tab-complete heuristics. QRV needs exactly none of that. A fresh line editor was written from scratch — 553 lines — covering the keystrokes that actually matter:

^A/^E/^B/^F       home / end / back / forward char
^P/^N + arrows    previous / next history
^R                reverse incremental history search
^K/^U/^W          kill-to-eol / kill-line / kill-word-back
^L                redraw
DEL/^H/^D         backspace / delete / EOF
ESC [A/B/C/D      cursor keys

Net: −5,279 lines of code the QRV shell will never need.

No fork()

QRV has no fork() by design. The original exchild() in jobs.c was 135 lines of fork-then-child-or-parent bookkeeping. The replacement is 25 lines: run the AST in this process, update job bookkeeping. External commands go through posix_spawn() + waitpid() at the TEXEC site in exec.c. The shell stays resident throughout.

This was also the source of v0.26's last bug. The fork-removal rewrite unconditionally passed XEXEC to execute(). In upstream mksh, XEXEC means "I'm a forked child — call unwind(LEXIT) when done." With no child, that unwind exited the shell after the first external command. Found and fixed: the test is simple in retrospect — run ls, then run pwd, and check whether the shell is still alive.

libc foundations

The port surfaced three missing pieces in QRV's libc, all added as part of this release:

setjmp/longjmp. <setjmp.h> had been declared but never implemented. The RISC-V assembly saves callee-saved integer registers (ra, sp, s0s11), mirroring the layout of the kernel's existing xfer-fault recovery path. Nine new tests: round-trips, the longjmp(env, 0) → 1 POSIX rule, callee-saved values surviving across call boundaries.

posix_spawn / posix_spawnp. The kernel side (_PROC_POSIX_SPAWN) and the spawnattr_* / file_actions_* helpers already existed. The entry point symbols were simply missing. Adapted from openqnx's reference implementation with the Adaptive Partitioning wire format removed (it was removed from QRV's taskman in an earlier release).

Signal API. POSIX signals implemented on top of QRV's existing pulse model. A per-process signal table in libc; the system thread that every QRV process already has translates _PULSE_CODE_SIGNAL pulses into sigaction-registered handlers. sigaction, sigprocmask, signal, raise, and kill all work. 19 new tests, all passing. raise(SIGUSR1) runs the handler synchronously before returning; kill(getpid(), SIGUSR2) round-trips through taskman → pulse → system thread → handler.

The test suite grew from 61 to 89 tests across this release: 9 for setjmp/longjmp, 19 for signals. All 89 pass.


qrvfs Format v2

One more thing that landed quietly: the filesystem got a format upgrade. qrvfs v1 used 9 direct + 1 single-indirect block, capping file size at ~2 MiB. That was fine for a test image. It is not fine for anything approaching real use.

v2 repartitions the same 10 address slots: 7 direct + single + double + triple indirect. Maximum file size: ~512 GiB. mkfs-qrv was rewritten around a recursive walk_indirect() that allocates index blocks lazily. A 1.1 GiB test file round-trips byte-perfect, exercising the triple indirect path.

v2 superblocks are incompatible with v1. qrvfs_init() rejects v1 images with a clear error.


What Comes Next

Background jobs (posix_spawn with detach rather than waitpid), poll() with real readiness probing (wiring _IO_NOTIFY through the resmgr layer), and getppid() / umask() / ioctl() / tcflush() — the eight libc stubs that currently back the shell's less-used features. Then the init script gains real conditionals, and the rc system starts to take shape.

Saturday, April 25, 2026

QRV v0.25: Booting from Real NVMe on Real Hardware

This is the follow-up to the earlier v0.25 post. Everything described there — the sync_hash fix, the DesignWare iATU, the INTx wiring, the mount command — was validated on QEMU. This post is about what happened when the same build ran on real silicon.

The short version: it works.


How QRV Now Boots on the Unmatched

U-Boot loads the QRV kernel binary and modpkg.cpio directly from the Debian ext4 partition (nvme 0:6) on the Samsung NVMe:

=> ext4load nvme 0:6 0x80200000 /boot/qrv/qrv-kernel.bin
397328 bytes read in 0 ms
=> ext4load nvme 0:6 0x90000000 /boot/qrv/modpkg.cpio
4688384 bytes read in 5 ms (894.2 MiB/s)
=> setenv initrd_size ${filesize}
=> booti 0x80200000 0x90000000:${initrd_size} ${fdtaddr}

QRV lives as a guest on its own machine, loading its boot files from the Debian partition's /boot/qrv/ directory. The QRV filesystem itself is on partition 8 — 2 GiB, type GUID 51525611-322e-4017-bae8-e4d9c9d4e979 — a custom GUID registered for QRV, so partition tools can identify it unambiguously.


The Boot Log

Here is the boot session.

Starting kernel ...

===============================================================================

Boot command line: -Dsbi
Board: SiFive HiFive Unmatched A00
Compatible: sifive,hifive-unmatched-a00
+------------------------------------------+
| QRV Operating System Kernel version 0.25 |
+------------------------------------------+
init_clocks
init_raminfo
asinfo: PCI IO  60080000 - 6008ffff
asinfo: PCI MEM 60090000 - 6fffffff
asinfo: PCI MEM 70000000 - 70ffffff
asinfo: PCI MEM 2000000000 - 3fffffffff
init_mmu
  ram_top = 480000000
  identity_map RAM 80000000..480000000
  ...
  prealloc_kernel_l2: filled 239 empty kernel L2 slots
  enable_mmu: Sv39 paging active
init_intrinfo: timer(base=5,n=1) plic(base=32,n=128)
init_cpuinfo: 4 CPUs, rv64imac, 1 MHz
hwinfo: serial sifive,fu740-c000-uart @ 10010000 irq 39
hwinfo: serial sifive,fu740-c000-uart @ 10011000 irq 40
hwinfo: pci sifive,fu740-pcie ecam df0000000/10000000 irq 57 windows 4
hwinfo: pci dw_pcie_msi dbi e00000000/80000000 irq 56
...
[1] kerlink: boot/taskman.qkx: resolved 6114/6114 external symbols (0 unresolved)
[1] kerlink: boot/taskman.qkx: applied 10631/10631 relocations (0 skipped)
[1] kerlink: boot/taskman.qkx: link complete
...
[1] cpu_start_others: hart 2 started
[1] cpu_start_others: hart 3 started
[1] cpu_start_others: hart 4 started

***********************************
* Welcome to QRV Operating System *
***********************************

     pid tid name               prio STATE       Blocked
       1   1 taskman             10r READY
       ...
       2   1 esh                 10r REPLY       1

Starting slogger...
Starting pci-server...

PCI devices:
00:00.0 PCI bridge [0604]: f15e:0000 (rev 00)
01:00.0 PCI bridge [0604]: 1b21:2824 (rev 01)
04:00.0 USB controller [0c03]: 1b21:1142 (rev 00)
06:00.0 NVM controller [0108]: 144d:a80a (rev 00)
07:00.0 VGA compatible controller [0300]: 10de:128b (rev a1)
07:00.1 Multimedia controller [0403]: 10de:0e0f (rev a1)

Probing for NVMe...
devb-nvme: controller at 06:00.0 BAR0 0x60400000
devb-nvme: controller enabled (CSTS.RDY=1)
nvme: controller VID:DID 144d:144d  ctrl-id 6
  Model:    "SAMSUNG MZVL2512HCJQ-00B00"
  Serial:   "S675NF0R487649"
  Firmware: "GXA7301Q"
nvme: ns 1  size=1000215216 LBAs  lba=512 B  capacity=488386 MiB
devb-nvme: interrupts on (IRQ=57 vector=89)
devb-nvme: GPT OK, 8 partition(s)
  p1: LBA 34..2081 (1 MiB)      type=5b193300-...  name=""
  p2: LBA 2082..10273 (4 MiB)   type=2e54b353-...  name=""
  p3: LBA 10274..227361 (106 MiB) type=c12a7328-... name="EFI"
  p4: LBA 227362..235553 (4 MiB) type=ebd0a0a2-...  name="CIDATA"
  p5: LBA 237568..144941055 (70656 MiB) type=516e7cb4-... name="FreeBSD"
  p6: LBA 144941056..857972735 (348160 MiB) type=0fc63daf-... name="Debian"
  p7: LBA 857972736..891527167 (16384 MiB) type=0657fd6d-... name="swap"
  p8: LBA 891527168..895721471 (2048 MiB) type=51525611-... name="QRV"
devb-nvme: ready (9 path(s), first NS 488386 MiB 512 B/LBA)
br--r--r--  1     0     0 512110190592 nvme0n1
br--r--r--  1     0     0   2147483648 nvme0n1p8

Mounting /dev/nvme0n1p8 at /disk2...
fs-qrv: qrvfs v1, 524288 blocks, 4096 inodes
fs-qrv: mounted qrvfs at /disk2 (dev=/dev/nvme0n1p8)
drwxrwxr-x  2     0     0      768 bin

# cd /disk2
# ls
bin
# cd bin
# ls -la
drwxrwxr-x  2     0     0      768 .
drwxr-xr-x  3     0     0      768 ..
-rwxrwxr-x  1     0     0    60144 syscall_testing
# echo "syscall_testing is here, on NVMe!!!"
syscall_testing is here, on NVMe!!!
# ./syscall_testing
=== QRV Syscall Conformance Test Suite ===

[timers]   25 tests ... all PASS
[sync]     23 tests ... all PASS
[printf]   13 tests ... all PASS

=== Results: 61 tests, 61 passed, 0 failed ===
# shutdown
Shutting down...

What This Means

The partition layout tells the full story without commentary: FreeBSD, Debian, swap, and — at the end — QRV. That last entry was created in advance, left empty, and waited.

The syscall_testing binary is not a toy. It tests kernel call semantics: error returns, resource exhaustion, object reuse, the _r variants that must not touch errno. Sixty-one tests, all passing, executed from a QRV filesystem on a Samsung NVMe, on a RISC-V workstation, through a microkernel IPC stack.

This is a system.


What Remains

Versions 0.26 through 1.0 are still ahead: writable filesystem, proper signals in userspace, network stack, and much more. The documentation effort is just beginning — a User Manual, a Programmer's Manual, a Kernel Calls & Taskman Message Reference, and the Porting Story that has been accumulating as LaTeX chapters throughout this sprint.

But the picture that was forming in someone's mind back in 1999, through RadiOS in assembly, through the HEIG-VD sources found during a COVID lockdown, through a restarted project on a father's birthday in February 2026 — that picture is now a running system.

The work continues.

QRV v0.25: mount -t qrv /dev/nvme0n1p8 /disk2

That mount command works. On QEMU virt, a qrvfs filesystem written by the host mkfs-qrv tool to NVMe partition 8 is mounted at /disk2, browsable with ls, readable with cat, with all four resource managers — devb-nvme, fs-qrv on /dev/nvme0n1p8, devb-virtio, fs-qrv on /dev/vblk0 — running simultaneously in user space.

On the SiFive HiFive Unmatched, the DesignWare PCIe controller enumerates the full board topology: USB hub, Samsung MZVL2512HCJQ NVMe (144d:a80a), NVIDIA VGA, NVIDIA audio. devb-nvme reads the Samsung's GPT — eight partitions, ~2 GiB at partition 8 — and registers /dev/nvme0n1 and /dev/nvme0n1p8 in the QRV namespace.

That last partition has been sitting empty since July 2021, when the board arrived and a deliberate choice was made to leave space for an operating system that didn't exist yet. Today it does.


The sync_hash Collision

Before any of the above was possible, a fundamental kernel bug had to be found. The symptom: the second instance of any resmgr binary — fs-qrv spawned for /dev/nvme0n1p5 while another fs-qrv was already mounted on /dev/vblk0 — failed with EBUSY on pthread_mutex_init during resmgr_attach. The two processes were otherwise healthy. Killing one let the other proceed. Running them sequentially worked fine.

The root cause is in RISC-V's cpu_pageman_vaddrinfo. Rather than walking the user page tables to find the physical address backing a virtual address, it does a lexical subtraction: PA = VA - KERNEL_VIRT_BASE. For kernel addresses this is an identity operation — every kernel VA maps to exactly one PA. For user addresses from processes loaded at the same virtual base (no ASLR, same ET_DYN layout), two processes end up at identical virtual addresses for their arena, and the lexical subtraction produces identical "physical" addresses for both.

The kernel's sync_hash uses the physical address as the key. Two processes calling pthread_mutex_init at the same user VA produce the same (obj=_syspage_ptr, offset=VA-BIAS) key. The second one collides with the first's entry and gets EBUSY.

On x86_64 this was benign because cpu_pageman_vaddrinfo walks the actual page tables, so two processes at the same VA get genuinely different PAs. The bug was a RISC-V-specific shortcut that was harmless when only one instance of each resmgr ran, and fatal as soon as two instances of the same binary ran simultaneously.

Fix: in pageman_vaddr_to_memobj and pageman_munmap, return the calling tProcess * as the hash object for private mutexes, making the key (prp, offset) instead of (_syspage_ptr, offset). Per-process keys cannot collide across processes by construction.


The DesignWare iATU

Getting PCI enumeration to work on the Unmatched required understanding something non-obvious about the FU740's PCIe controller.

QEMU's pcie-ecam-generic host bridge provides a flat ECAM memory window: a contiguous region where the address encodes bus<<20 | slot<<15 | func<<12 | register. You mmap it once and read any device's config space by computing an offset. Straightforward.

The SiFive FU740 uses a Synopsys DesignWare PCIe controller. Its "config" memory window is not flat ECAM. The DW iATU encodes CFG TLPs as bus<<24 | slot<<19 | func<<16 | register — a different bit layout, wider per-field. A single iATU mapping cannot cover the whole bus space. The canonical pattern (FreeBSD pci_dw_read_config, Linux dw_pcie_other_conf_map_bus) is to re-aim outbound iATU region 0 to the specific (bus, slot, func) before each config access.

New file servers/pci/pci_dw_atu.c implements this:

  • pci_dw_atu_init(dbi_pa) mmaps 4 MiB of the DBI register space and auto-detects whether the controller uses legacy iATU (at DBI+0x900) or unroll mode (at DBI+0x300000) by reading DW_IATU_VIEWPORT — it returns 0xFFFFFFFF in unroll mode.
  • pci_dw_atu_aim_cfg(bus, slot, func, type, cfg_pa, cfg_size) programs region 0 with TYPE_CFG0 (immediate child, bus 1) or TYPE_CFG1 (downstream of bridge, bus ≥ 2), polls CTRL2.REGION_EN for completion.
  • Bus 0 (the controller's own pseudo-bridge) reads directly from DBI; non-zero devfunc on bus 0 returns 0xFFFFFFFF (no device).

The FDT parser was also reworked: the FU740's device tree uses reg-names and interrupt-names to distinguish the DBI window from the ECAM/config window and MSI interrupt. Previously the parser read them in order and assigned ecam_base to whichever reg entry came first — on the FU740 that is the DBI window (wrong), not the config window. The fix makes the commit block idempotent and re-reads on the second pass once reg-names is available, correcting ecam_base to reg[1].

On the Unmatched: unroll mode detected, full PCIe topology enumerated.


PCIe INTx: Three Bugs in One Session

Getting interrupt-driven NVMe working on QEMU required fixing three latent bugs in the PCIe INTx allocation path, all of which meant pci_device_read_irq returned 0.

Bug 1: libpci's pci_device_attach was missing PCI_INIT_IRQ from the attach flags sent to pci-server. Without it, the server's entire IRQ-allocation path was skipped unconditionally.

Bug 2: The server's IRQ resource pool was never seeded on RISC-V. The original QNX flow relied on BIOS-programmed Interrupt_Line registers that pci_enum could pci_reserve_irq() from. QEMU virt leaves Interrupt_Line at 0xFF. Nothing got reserved; every subsequent rsrcdbmgr_attach call failed. Fixed by seeding INTA..INTD (irq_base..irq_base+3) at ecam_attach() time.

Bug 3: ecam_avail_irq() returned the full unswizzled [INTA, INTB, INTC, INTD] list, so pci_alloc_irq picked the numerically lowest free IRQ rather than the pin-routed one. The correct result for a PCIe device is a single wire: irq_base + ((device + pin - 1) % 4). Read Interrupt_Pin from config space and return that one.

After all three: pci_device_read_irq on the QEMU NVMe at 00:02.0 with pin A returns PLIC IRQ 34 — the wire the device actually asserts.

devb-nvme then switched from polling to proper IST-driven command completion: a dedicated IST thread drains both admin and I/O CQs on each wake (INTx is shared between queues), latches (status, cid) into a per- queue mutex/condvar/done gate, signals the submitter, and unmasks the kernel vector. The polling path remains as fallback for admin commands during bringup before the IST starts.


The Synopsys DW MSI Controller (Stages 1–3)

For MSI-X support — which NVMe devices strongly prefer over INTx — the FU740's internal Synopsys DesignWare MSI controller needs to be brought up. It receives posted MSI writes from devices, sets bits in MSI_INTR_STATUS_0, and asserts one aggregate PLIC IRQ (vector 56 on the Unmatched).

Three stages landed this release:

Stage 1: FDT parsing of the DBI window and MSI IRQ number. The kernel publishes a dw_pcie_msi child device under the PCI bus, carrying the DBI window location and aggregate PLIC IRQ. On QEMU virt nothing publishes this; the existing INTx path is completely unchanged.

Stage 2: ecam.c consumes the child device when present, picking up the DBI window address and MSI IRQ for later use.

Stage 3: pci_dwmsi.c — the controller bringup driver. Maps 4 KiB of DBI (the DTB declares a 2 GiB window; mapping it in full would consume pci-server's entire address space, so it is capped), allocates an anonymous page as the MSI target, programs ADDR_LO/ADDR_HI/ INTR_ENABLE/STATUS, and attaches an IST to the aggregate PLIC IRQ. Each MSI fire is logged. Per-vector pulse dispatch to waiting drivers is Stage 4, deferred.


Two Other Bugs Worth Noting

Recursive kmutex initialization. mutex_init() with PTHREAD_RECURSIVE_ENABLE was using KMUTEX_RMUTEX_INIT, which sets owner to QRV_SYNC_INITIALIZER (0xffffffff). That sentinel is for statically-initialized user-space pthread mutexes; kernel mutexes never go through the sync subsystem and their mutex_lock() spins waiting for owner==0. 0xffffffff never becomes 0. Any lock of a recursively- initialized kmutex spun forever.

Separately, taskman/sys/support.c's per-process lock was being created with NULL attr (non-recursive + error-check), while reentrant taskman paths that take the same process's mutex twice on one thread — QueryObject → handler → proc_lock_pid(same pid) — hit EDEADLK and crashed. Both fixed; together they were taking down taskman on the Unmatched during pci-server startup.

vfprintf NUL-byte truncation. PRINT(ox, 2) and PRINT(&sign, 1) stash pointers to stack-local variables into the IOV vector for deferred __sfvwrite consumption. Both locals get reset to '\0' at the top of every new format specifier. When one format spec's PRINT'd pointer was still queued and the next spec's reset fired first, the pointed-to bytes became NUL before __sfvwrite read them — emitting a stray NUL mid-output that truncated any consumer reading the buffer as a C string.

The symptom: pci_slogf("%#lx PLIC IRQ %u", ...) on the Unmatched produced "0xe00000000/0" — the 0x prefix of the second %#lx became "0" + NUL when the trailing %u iteration wiped ox[1] before the flush. Fix: drain the IOV via FLUSH() at the end of the pforw block so the locals are consumed before the next iteration resets them. Thirteen new printf regression tests, all passing.


The Test Suite

61 tests across three subsystems now run on every build: 25 for TimerCreate/TimerDestroy, 23 for the sync subsystem (mutexes, condvars, semaphores — including error paths, resource exhaustion, and object reuse), and 13 for printf (including the %#lx+%u regression that caught the NUL bug). All 61 pass on QEMU virt.


The Partition That Waited

The CHANGELOG notes it plainly: devb-nvme on the Unmatched read the Samsung's GPT and found 8 partitions.

That last partition was formatted and left empty in July 2021 when the board first arrived. The intent, at the time, was to eventually fill it with a custom filesystem on a QNX-compatible microkernel ported to RISC-V. Four years and a few months later, mount -t qrv /dev/nvme0n1p8 /disk2 is a working command.

Thursday, April 23, 2026

QRV v0.24: NVMe Driver, GPT Partitions, and libpci

QRV v0.24 is the NVMe release. From zero to a working block device driver with GPT partition support, developed in one day across four phases. The driver lives entirely in user space, built on top of the QNX resource manager framework and the PCI server that has been accumulating infrastructure since v0.22. The boot sequence now includes:

devb-nvme: GPT OK, 2 partition(s)
  p1: LBA 34..32801 (16 MiB) type=0fc63daf-... name="qrv-test-p1"
  p2: LBA 32802..65569 (16 MiB) type=0fc63daf-... name="qrv-test-p2"
devb-nvme: ready (3 path(s), first NS 64 MiB 512 B/LBA)
br--r--r--  1     0     0 67108864 nvme0n1
br--r--r--  1     0     0 16777216 nvme0n1p1
br--r--r--  1     0     0 16777216 nvme0n1p2

This is read-only for now. Writing to NVMe, mounting a filesystem on a partition, and the Unmatched bring-up are the next steps.


libpci: A Proper Client Library

Before the NVMe driver, the cleanup. Every /dev/pci client — lspci, devb-nvme in its prototype form — was carrying its own MsgSend plumbing, its own pci_cfg_read{8,16,32}, and its own IOM_PCI_ATTACH_DEVICE wrapper. That is the wrong way to grow a driver ecosystem.

lib/libpci pulls all of it up into a shared library with a public API modeled on QNX 8.0's <pci/pci.h>: pci_bdf_t, pci_devhdl_t, pci_ccode_t, and the full pci_device_{attach, detach, find, read_*, cfg_rd*, cfg_wr*, find_capid*, is_multi_func, read_ba, read_irq, write_cmd} surface. The shim underneath still speaks the QNX 6.4 IOM_PCI_* message protocol that pci-server uses.

lspci shrank from 370 lines to 219. The NVMe driver skeleton shrank from 290 lines to 133. Future drivers start with discovery, attach, and config reads already working.

A related fix: pci_device_find() was previously a client-side scan — 256 × 32 × 8 iterations of cfg_rd32, each a MsgSend round-trip to the PCI server, about 80 µs each. Five seconds of wall time just to find (or not find) a device. The server has already enumerated everything at startup and holds the list. pci_device_find() is now a single IOM_PCI_FIND_CLASS message. Boot time dropped noticeably.


The NVMe Driver: Four Phases

Phase A: PCI probe and skeleton

devb-nvme starts by scanning /dev/pci for base class 0x01 / subclass 0x08 (NVMe storage), attaches with PCI_INIT_BASE0 | PCI_MASTER_ENABLE so the server programs the BARs, and reads the NVMe capability registers. This is the scaffolding phase — no queues, no commands, just confirming the controller is present and readable.

emu.sh gets a -nvme flag that attaches a 64 MiB emulated NVMe drive (nvme.img, auto-created on first use). Without the flag, the driver finds nothing and exits cleanly.

Phase B: Controller reset, admin queue, Identify

The controller is reset (CC.EN=0, wait for CSTS.RDY=0), an admin submission and completion queue pair is allocated with MAP_ANON and their physical addresses discovered via mem_offset(), and the controller is re-enabled (program AQA/ASQ/ACQ, set CC, wait for CSTS.RDY=1).

Three Identify commands enumerate the namespaces: Identify Controller (CNS 0x01), Identify Active Namespace List (CNS 0x02), and Identify Namespace (CNS 0x00) for each active NSID.

One bug was caught here that is worth noting explicitly. Each Identify call was allocating a fresh MAP_ANON page and munmap()ing it between calls. The second and third Identify results came back as garbage — non-deterministic garbage, the kind that "fixing" by inserting a debug print is the first warning sign of a memory ordering issue. The underlying problem: physical page reuse with stale cache state, combined with the fact that RISC-V plain loads are not ordered and volatile only blocks the compiler. Switching to a single persistent Identify buffer (one mmap at init, zeroed before each use, never unmapped) produced 8/8 clean runs.

Phase C: I/O queue and first Read

An I/O queue pair is created at qid=1 via the admin Create-IO-CQ and Create-IO-SQ commands. A 1-LBA Read against LBA 0 of the first active namespace is issued and the result hexdumped.

With a recognizable pattern planted in nvme.img:

devb-nvme: Read nsid=1 LBA=0 (512 bytes) OK
  0000: 51 52 56 2d 4f 53 20 4e 56 4d 65 20 74 65 73 74  |QRV-OS NVMe test|
  0010: 20 70 61 74 74 65 72 6e 3a 20 4c 42 41 20 30 20  | pattern: LBA 0 |

A second ordering bug surfaced here. The polling path read cqe->status, checked the phase bit, and then read cqe->cid and the PRP1 payload buffer. On x86 this is fine because loads are ordered. On RISC-V they are not — the CPU is free to reorder the payload read before the phase check. The symptom was non-deterministic garbage in consecutive Identify buffers, with each run corrupting differently. The fix: __atomic_thread_fence(ACQUIRE) immediately after the phase check. Five clean runs afterwards.

This is a class of bug that trips up essentially everyone porting code from x86 to a weakly-ordered architecture. The code was correct for the hardware it was written on. RISC-V doesn't make the same promises, and neither volatile nor careful reading of generated assembly makes it obvious.

Phase D: Resource manager, /dev/nvme0n1

The driver is promoted from a probe tool to a resident daemon. After controller bring-up, it registers /dev/nvme0n1 as a read-only block-device resource manager and enters the dispatch loop.

The read path: io_read computes the LBA span covering the requested byte range, issues nvme_cmd_read() into a per-driver 4 KiB bounce buffer, and MsgReplys the requested slice. One 4 KiB page per call so PRP1 covers the entire transfer.

One sharp edge caught during bring-up: dispatch_create must come before any /dev/pci traffic. The first attempted resmgr_attach(/dev/nvme0n1) after a successful pci_device_attach() failed with errno = -EBUSY. Creating the dispatch — and calling the "set up internals" resmgr_attach(path=NULL) — at the very top of main(), before any PCI messages, makes the later path registration succeed 5/5. This is almost certainly a channel/connection slot ordering issue inside libc's dispatch and message plumbing. Documented here for the next driver to hit it.

Phase 3e: GPT partitions

After registering /dev/nvme0n1 for the whole namespace, the driver walks the GPT via libgpt and registers one resource manager path per partition. Each partition node gets its own nvme_region_t (a iofunc_attr_t subclass) carrying start_lba, nlba, nsid, and size_bytes. Every handler recovers the region from ocb->attr and enforces partition boundaries — a partition node cannot read outside its own range.

If there is no GPT, a bad CRC, or a hybrid MBR, the driver falls back to raw /dev/nvme0n1 only. This is non-fatal and produces a diagnostic.


libgpt

lib/libgpt is a pure GPT parser. The caller supplies a read-LBA callback; the library does no I/O of its own. It validates the protective MBR, GPT header signature and revision, header CRC32, and partition-array CRC32. UTF-16LE partition names are unpacked to UTF-8. Plain MBR and hybrid GPT are rejected outright — a hybrid GPT (MBR carrying partitions other than a single 0xEE protective entry) is explicitly not supported and the library says so clearly rather than guessing.

A host-side self-test (make test in lib/libgpt) exercises the valid path and four rejection paths: plain MBR, hybrid MBR, bad header CRC, bad entry-array CRC.

scripts/mkgpt.py is a pure-Python host tool that writes a valid GPT to an image file — protective MBR, primary header, primary entry array (128 × 128 bytes), using the Linux filesystem data type GUID. The emu.sh -nvme path uses it to lay out two 16 MiB test partitions on first run.


PCI Discovery via Syspage

pci-server was using a hardcoded platforms[] table to know where the ECAM region lives on supported boards. That is the wrong architecture for a system that is supposed to be device-tree-driven.

The kernel's FDT parser now publishes the PCI host bridge to hwinfo as an HWI_ITEM_BUS_PCI bus with location, IRQ, and PCI window tags (memory, IO, prefetchable, all with correct flags). pci-server discovers the bridge via hwinfo_find_bus(). The hardcoded table is gone. If the kernel didn't publish a PCIe bus, pci-server fails with ENODEV rather than silently defaulting to a QEMU virt address.


What Comes Next

/dev/nvme0n1 and /dev/nvme0n1p1/2 exist, are readable, and have correct sizes. The next step is mounting fs-qrv on a partition instead of on the virtio block device — and then bringing this up on the Unmatched, which has a real M.2 NVMe slot and a PCIe bus waiting.