In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Monday, April 20, 2026

QRV v0.22: 50/50 Clean Boots, PCI, and the L2 Page Table Hunt

50/50 clean boots on QEMU. Eight CPUs, 10 kHz timer tick, CONFIG_DEBUG_PRIL and CONFIG_DEBUG_VECTOR both enabled. This number was 27/50 at the start of this release cycle. The other 23 were the "fs-qrv hang" — a silent lockup at 100% host CPU, no crash banner, no diagnostic output, just the system sitting there. Finding and fixing it was the main event of v0.22, and the story of how it was found is worth telling properly.

Everything else — PCI bring-up, system logging, event-driven serial RX, per-thread errno, a full warning cleanup from 559 to zero, and a dozen smaller things — happened alongside and after that hunt.


The Crown Jewel: Sharing the Kernel L2

What the hang looked like

When fs-qrv started — and only after fs-qrv started — the system would sometimes freeze silently. Host CPU at 100%. No output. No crash banner. The --disable-timer boot option (added during the investigation, described below) provided the first clue: disabling the timer reduced the hang rate but didn't eliminate it. Timer-driven preemption was a necessary condition, not the whole story.

Eventually, a GDB session caught it in the act.

Six harts were spinning in the IPI_PARKIT for-loop at trap_handler.c:69 — the code that parks secondary CPUs while one CPU emits a crash dump. But there was no crash dump. Something had sent IPI_PARKIT and then never followed through. Looking closer: two harts were taking nested store page faults at the very first instruction of __strap_entry — the register save in trap.S. Both faulting harts shared the same satp value: a single user process's page table.

The 100% host CPU explained itself too: the parked harts loop on wfi with sie.SEIE/sie.STIE still enabled, so external IRQs keep waking them one instruction at a time, immediately re-entering wfi. A hall of spinning mirrors, invisible and silent.

GDB couldn't read kernel-RAM virtual addresses through the faulting satp — not _actives[2], not cpupage, nothing in the kernel data segment. But it could fetch kernel text. The page table had kernel text mapped but kernel data and stacks missing.

Both faulting sp values landed in L2 slot 1 of the Sv39 address space — the range 0xffffffc040000000–0xffffffc080000000. That is the gigabyte slot immediately below the RAM direct-map. The slot that would be needed for any per-thread kernel stack allocated there.

The root cause

cpu_pgdir_create() — called every time a new user process is created — copied the kernel root page table's L2 entries (indices 256..511) into the new user pgdir. This is correct and necessary: user-space page tables need the kernel mappings so the kernel can run after a trap while the user process's satp is still loaded.

The problem is the word copied. It is a snapshot. A photograph taken at the moment the process was created.

Any kernel mapping installed after that snapshot — in particular, any per-thread kernel stack that happened to be allocated in a 1 GiB Sv39 slot that was empty at process-creation time — would populate the kernel's own L2 with a new PTE pointing to the L1 for that slot. But every user process created before that moment has a stale copy of the kernel L2 that still shows that slot as empty.

When the kernel subsequently runs under such a user satp and touches an address in that slot — for example, during the trap prologue that saves the faulting thread's registers onto a kernel stack — the MMU walks the user pgdir's copy of L2, finds an empty entry, and raises a store page fault. That fault is taken while already in S-mode, which cannot be handled normally. The double-fault path sends IPI_PARKIT to all other CPUs. If two harts hit this simultaneously, each parks the other, neither prints a crash banner, and the system halts silently at 100% host CPU.

The fs-qrv hang was always this bug. fs-qrv was just the first process that allocated enough kernel threads to push a new stack into a previously-empty L2 slot, after enough other processes already existed with stale pgdirs.

The fix

At the end of init_mmu(), before enable_mmu(), walk every kernel L2 slot (indices 256..511). For any slot that is still empty, allocate an empty L1 page and install it. From that point on, the kernel L2 has no empty slots.

Now when cpu_pgdir_create() copies the kernel L2 into a new user pgdir, it copies PTEs that point to real L1 pages — shared L1 pages. Any future kernel mapping that adds L0/L1 entries within those already-valid L1 pages is immediately visible in every user pgdir, because they all share the same L1. The snapshot is now a snapshot of pointers to shared structure, not a snapshot of the structure itself.

Cost: at most 256 × 4 KiB = 1 MiB of pinned RAM, allocated at startup. In practice, on our boot, 254 of 256 slots were empty and got populated. The other two already had the low-PA and RAM direct-map L1 pages.

Belt-and-braces: cpu_pgdir_create() now calls crash_halt() loudly if any kernel L2 slot is empty when a user pgdir is created. A future regression will be visible immediately rather than quietly rotting. A new chk_kernel_sp() inline asserts that any value written to ker_stack[cpu] or thp->trap_aux.kernel_sp lies within the direct-mapped RAM VA range.

Results:

metric before after
PASS (shell prompt) 27/50 50/50
fs-qrv silent hang intermittent zero
0x40116 ra crashes 5/50 zero

The 0x40116 crash class — a pool thread's saved ra containing the io_close_t type+combine_len constant — also vanished with the same fix. The most likely explanation: those were the same bug, a half-completed register save into a kernel stack that the running satp couldn't actually reach, leaving the previous ra value unmodified.


The Tools That Found It

Two pieces of debug infrastructure were built during the hunt, both of which will remain useful permanently.

--disable-timer hard-disables the S-mode supervisor timer. Every site that arms it becomes a no-op: arch_timer_load, arch_timer_reload, cpu_idle's initial SIE enable, the per-dispatch sbi_set_timer, the CLINT rearm in the trap handler. With the timer off, 0/10 boots crashed — confirming that timer-driven preemption was a necessary condition for the hang, and giving a clean environment for other debugging.

-T (tracelogger auto-start) allocates four 16 KiB trace ring buffers at kernel boot time and enables all kercall and COMM event masks. On an S-mode crash, the trap handler sends IPI_PARKIT to all other CPUs, waits for a grace period so their output settles, then walks the ring and dumps every event to the console. The last ~hundred kernel events before any crash are now automatically attached to the crash log — no user-space tracelogger daemon needed.

The first witness capture (trT_2.log) caught the 0x40116 crash cleanly: 439 events, cpu 1, tid=8, with no interleaving from other CPUs.


Other Reliability Fixes

Several other SMP bugs were found and fixed during the same period.

PLIC lost-interrupt race. The PLIC trap handler was calling ready() directly in interrupt context to wake the IST thread. If the IST was not in STATE_INTR at IRQ time — already processing a previous interrupt in user space — the wakeup was silently dropped. Boots hung on the first virtio block I/O fairly often. Fix: route IST wakeups through intrevent_add() with a synthesized SIGEV_INTR event, matching the QNX/MIPS pattern. intrevent_drain() from __ker_exit under the kernel lock then safely ready()s or sets the pending flag.

-P 1 boot hang. Four distinct gaps contributed. The virtio IST was created at default priority 10, same as every other worker — it never preempted. SIGEV_INTR priority wasn't bumped to force preemption at equal priority. __kercall_dispatch's fast path returned directly via __ker_restore without checking intrevent_pending, leaving queued events to wait for the next syscall. The trap-return path similarly bypassed __ker_exit, so interrupt-queued events didn't drain until the next kernel entry. All four closed; -P 1 now boots 5/5.

PLIC enabled on halted harts. riscv_plic_unmask was enabling IRQs on all physical harts from the FDT, including those sitting in OpenSBI's halt loop under -P N. A halted hart can claim a PLIC interrupt and then have no S-mode handler to process it — consuming the interrupt silently. Fixed to enable only on sp->num_cpu (the OS-active harts).

SUM bit stripped on ecall. The S-mode ecall path was writing act->reg.sstatus with just SPP|SPIE, stripping the SUM bit (which permits S-mode to access user-page-protected memory). Any kercall handler that subsequently touched a user VA — sync_create, anything using user-space sync objects — faulted with scause=15. SUM and FS are now preserved from the incoming sstatus.

AP startup missing SUM. The AP (application processor) startup path in _smpstart never set SSTATUS_SUM. Only the boot CPU's init_cpu() did. Secondary CPUs couldn't touch user pages during kercall handlers.


Per-Thread errno

The kernel's __get_errno_ptr() was returning the address of a single global int errno shared across all S-mode threads — kernel and taskman alike. Two threads on different CPUs could race in the three instructions between a DebugProcess() call setting errno = -r and the caller reading return errno. About 10% of boots showed pidin printing phantom thread rows: DebugProcess returned ESRCH, another thread clobbered errno to 0 before procfs read it, pidin saw a successful devctl with a stale status struct and printed num_threads*2+16 rows of garbage.

Fix: route __get_errno_ptr() through the current thread's _thread_local_storage->__errptr via the tp register, falling back to the global only before TLS is initialized. 0/40 phantom thread entries after the fix.

Two companion fixes: clearing KERERR_SET/BUFF_MSG flags at kercall entry (they were leaking across calls, causing cond_wait to abort without blocking), and disabling KERERR_SKIPAHEAD (advancing sepc past a nonexistent error branch in our libc syscall wrappers, corrupting the caller's return value).


PCI Bring-Up

The PCI server is ported from QNX 6.6 (bios-v2 variant), split into modules, stripped of all x86-specific code (Intel/AMD/VIA/SiS/NVIDIA vendor dispatch, DLL loading), and cleaned to LP64 with zero warnings. It registers /dev/pci as a QNX-style resource manager.

Startup now does a second FDT pass to find pci@/pcie@ nodes, extract the ECAM base, memory/IO/prefetchable window ranges, and INTA IRQ number, populating syspage with PCI memory windows via rsrcdbmgr.

lspci is a new userland utility — queries /dev/pci through the message-passing protocol, lists all enumerated devices, supports -v for BARs/IRQs/capabilities and -vv for config space hex dump.

On QEMU virt, two devices enumerate: host bridge (1b36:0008) and PCI serial (1b36:0002). The rsrcdbmgr — which was a stub since v0.6 — is now a full implementation ported from QNX, handling minor-device-number allocations and named resource reservations.

This is the preparatory step for NVMe on the SiFive Unmatched.


System Logging

slogger is a log daemon ported from QNX, running as a user-space resource manager at /dev/slog. Any process can call slogf() (a new libc helper) to emit a log entry. sloginfo reads and filters the circular buffer, with -w to wait for new events. slogger also serves /dev/console and accepts a -t flag to tee text events to stdout.


Event-Driven Serial RX

devc-ser8250's ser_read used to poll for incoming data by sleeping 50 ms between retries via TimerTimeout(NANOSLEEP). This made the shell unresponsive with --disable-timer (the main thread sat in STATE_NANOSLEEP forever), and introduced up to 50 ms of keystroke latency even with the timer on.

The replacement is a proper QNX-style deferred reply: ser_read checks the RX buffer once. If a byte is already there, deliver it synchronously. Otherwise, stash the rcvid in pending_rcvid and return _RESMGR_NOREPLY — the resmgr framework doesn't reply, the main thread drops back to MsgReceive. When the IST fires and delivers a byte via pulse, rx_pulse_handler applies ICRNL+ECHO and calls MsgReply to the saved rcvid directly. No timer involvement. Zero keystroke-to-echo latency regardless of CONFIG_TIMER_HZ.

A double-check pattern closes the IST race: set pending_rcvid, then re-read the RX buffer. If a byte appeared in that window, the IST's pulse already fired — deliver synchronously and undo the stash.


Per-IRQ Fire Counter

struct interrupt_entry grows a uint64_t fire_count, incremented on every PLIC and timer dispatch. Surfaced through debug_irq_t / DCMD_PROC_IRQS / pidin -F irqs:

timer (vec 0x5)    fires=~2500/s at 10 kHz tick
UART  (vec 0x2a)   fires climbing on each keystroke
virtio (vec 0x21)  fires steady at idle

Distinguishing "IRQ not firing" from "IRQ fires but isn't reaching the IST" turns out to be exactly the question you need to answer when debugging driver initialization. This counter answers it directly.


Warning Cleanup: 559 → 0

559 compiler warnings at the start of this cycle. Zero at the end, with CONFIG_DEBUG_PRIL=y + CONFIG_DEBUG_VECTOR=y and a heavy -W set.

Most of the work was in the trace/instrumentation path, where the original QNX trace wire format uses 32-bit slots — correct for 32-bit, a source of -Wsign-conversion and -Wconversion noise on LP64. The approach was to make the narrowings explicit rather than silent: TRACE_PACK_PTR, TRACE_PACK_INT, TRACE_PTR_LOW32 macros for the trace boundary; typed helpers replacing generic __max/__min macros; (uint32_t) casts at emit sites. Six sweeps across kernel, taskman, libc, and userland.

The CONFIG_DEBUG_PRIL cleanup also found a real bug: check_vec() had K&R-style declaration and three LP64 truncations that turned vector entries into 32-bit values; pril_validate() was dereferencing n->type before checking n == NULL, so corrupted lists crashed the validator before it could report the corruption.


libc Cleanup

The LP64-only pass through the support headers: _intsup.h trimmed from 200 to 42 lines (MSP430 __int20, GCC < 3.2 fallbacks, combinatorial type-size detection — all gone); cdefs.h and features.h stripped of Watcom, HighC, Metrowerks, pre-ANSI C, Cygwin, and RTEMS paths; platform.h down to a clean LP64-only file; dead headers _ansi.h, _pack1.h, _packpop.h, sched_aps.h deleted; the last directory named qnx/ renamed to qrv/.

time() now calls ClockTime(CLOCK_REALTIME) instead of returning 0. vfprintf had a trailing-space bug: %02x with value 0 produced "00 " instead of "00" — a missing LADJUST guard on the trailing pad.


Documentation

Three new chapters in the PortingStory document:

Chapter 5 covers the fault-service RPC — how PageWaitInfo and PageCont bracket every invocation of taskman's fault handler, handing off the faulting thread's identity and verdict between kernel and taskman.

Chapter 6 is the fs-qrv hang investigation written as a debugging narrative: the GDB session, decoding the satp+sp+scause fingerprint, the L2-snapshot root cause, and the fix.

SysArch Chapter 8 and Porting Chapter 2 cover the tracing and instrumentation subsystem — the CONFIG_INSTRUMENTATION gate, the parallel kercall dispatch table, the ring buffer format, and the crash dumper.


What Comes Next

50/50. The baseline is solid. NVMe on the Unmatched is the next hardware target — the PCI server is running on QEMU; connecting it to real PCIe hardware and then to devb-nvme is the path to persistent storage on real silicon.

No comments: