In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Tuesday, April 21, 2026

QRV v0.23: The Stale SATP Bug, SiFive UART, and PLIC on Real Silicon

QRV v0.23 closes a ~10% daemon-spawn hang that had survived v0.22, brings a native UART driver to the SiFive Unmatched, and fixes the PLIC interrupt delivery path for the FU740's heterogeneous hart layout. After v0.22 the system booted reliably but still hung with measurable frequency during the init script's driver startup sequence. That hang is gone.


The Stale SATP

What was happening

The hang pattern: after the L2 preallocation fix in v0.22, the system reached 50/50 clean boots under the controlled test conditions, but real-world daemon spawning — particularly during the parallel startup of fs-qrv, devb-virtio, and devc-ser8250 — produced silent hangs at roughly 10% frequency. Same signature as before: 100% host CPU, no crash banner, no stall detector firing.

A GDB session on April 22 caught it live. Three CPUs were parked in IPI_PARKIT. One hart was trapped in a loop between trap.S:53 and trap.S:203 — the trap prologue's register save, double-faulting on stores into the kernel stack, over and over. Its satp pointed at a physical page whose contents disassembled as RISC-V instructions, not page table entries. The page sat near the top of RAM — exactly where modpkg ELF content lives. aspaces_prp[1] and aspaces_prp[2] held the same tProcess pointer, yet the two CPUs' satp CSRs pointed at different physical pages.

The root cause

A kernel-level optimization made some time ago had modified the remote-CPU arm of kerext_process_shutdown(): instead of the original IPI_TLB_SAFE round-trip with try_again and spin-wait (the pattern from the openqnx reference code), it rewrote aspaces_prp[i] inline and sent only IPI_TLB_FLUSH. The stated rationale was that idle CPUs only touch kernel memory, so a stale satp is harmless — the optimization looked reasonable on the surface, and the immediate tests didn't catch it.

The rationale was accurate for about one millisecond. The moment memmgr.mdestroy() ran and returned the dying process's pgdir page to the physical allocator, the page was available for reuse. The next consumer was typically loader_elf mapping an ELF segment for the next daemon to spawn. The remote hart's satp now pointed at module code bytes; its next TLB miss walked them as page table entries, decoded bogus PPNs, and store-page-faulted trying to save registers on the kernel stack. The double-fault path sent IPI_PARKIT to all other CPUs, and the system locked up silently.

The lesson, now filed in the project's documentation: when ported reference code does something that looks like a wasteful SMP round-trip, the default assumption should be "it's a synchronisation point, not decoration". The IPI_TLB_SAFE round-trip exists precisely to guarantee that the remote CPU has truly swapped its satp away from the dying pgdir before mdestroy() is allowed to reclaim the physical page.

The fix

Restore the original QNX pattern. Send IPI_TLB_SAFE, not IPI_TLB_FLUSH. The remote hart's IPI handler calls set_safe_aspace(cpu), which calls memmgr.aspace(sysmgr_prp, &aspaces_prp[cpu]), atomically updating both the array slot and the remote hart's satp CSR. The ProcessShutdown wrapper's existing spin-wait on ipicmds[cpu] & IPI_TLB_SAFE activates. try_again=1 is set so kerext_process_shutdown returns EAGAIN and the caller re-reads aspaces_prp — only when every remote hart has genuinely switched away from the dying pgdir does shutdown proceed to mdestroy().

Verification: 50 runs, 9 seconds each, with the fix: 49/50 pass. The one survivor is an unrelated residual race in fs-qrv's own rtld path — a different class, not this one.


Stall Detector

Two improvements to the in-tree stall watchdog.

Threshold raised from 500 ms to 1 s. A shell waiting for keyboard input is legitimately silent for well over 500 ms between keystrokes. The old threshold produced a steady stream of false positives that drowned out the dumps we actually care about.

Kernel-held condition added. The watchdog now requires two conditions to fire simultaneously: no kprintf activity for 1 s and the same CPU continuously holding the kernel lock for at least 1 s. Post-prompt idle has INKERNEL_NOW=0, so the kernel-held counter stays at zero and the dump is suppressed even if kprintf is silent for minutes. A genuinely wedged kernel call pegs the owner in inkernel for the duration, so the watchdog catches it cleanly.

Also: stall_dump() force-releases kprintf_slock before its first print. Defensive, but necessary: if the dump wins the CAS and another CPU happens to be stuck mid-print holding that lock, without the release the dump itself would spin forever — precisely the symptom under investigation.

A companion fix: the KERERR_SET/BUFF_MSG per-call flag clear that was added to __kercall_dispatch (the kernel-internal syscall path) in v0.22 was missing from the parallel U-mode ECALL path in strap_handler_c. Both paths call ker_call_table[num], but only one was clearing the flags. The leak reproduced ker_sync_condvar_wait aborting without blocking — virtio's pthread_cond_wait completion loop spinning forever while the IST starved. A 100-run torture test after the fix: 0 early condvar returns.


SiFive Unmatched: Native UART Driver

Up to v0.22, the Unmatched's console ran through the in-kernel devtext device, falling back from devc-ser8250 (which targets the ns16550a register layout, absent on the FU740). v0.23 adds devc-sersifive — a user-space resource manager for the SiFive FU540/FU740 UART IP block.

The architecture mirrors devc-ser8250: IST thread, MsgSendPulse, and event-driven blocking reads via _RESMGR_NOREPLY / deferred MsgReply. The register layout is different: 32-bit MMIO, TXDATA/RXDATA with bit-31 full/empty flags, separate TXCTRL/RXCTRL, IE/IP, and DIV for baud rate. The driver discovers its hardware via hwinfo_find_device("sifive-uart", …) and exits silently if nothing is found, so the init script can start both drivers and each finds its own hardware or bows out.

Validated on the Unmatched: interactive shell reached; ls, pwd, pidin irq, and shutdown all executed through the SiFive UART.


PLIC Context Mapping for the FU740

The PLIC register offsets for per-hart S-mode contexts were previously computed as hart * 0x100 (S-enable) and hart * 0x2000 (S-priority / S-claim). This formula assumes every hart has both M-mode and S-mode contexts in (M, S) pairs starting from hart 0 — which holds on QEMU virt, but not on the SiFive FU540/FU740.

On the FU740, hart 0 is the E-mode monitor core (S7, no MMU, no S-mode context). It contributes only one PLIC context, which shifts every subsequent S-mode context down by one. The OS was writing and reading the wrong PLIC context registers on the Unmatched, and interrupts were never arriving.

Fix: compute plic_s_ctx[hart] from the FDT cpu@N layout after parse_fdt() and index SENABLE/SPRIORITY/SCLAIM by that table. On QEMU: plic_s_ctx[N] = 2N+1 (unchanged behavior). On FU740: cpu@0 has no mmu-type, contributes 1 context, so harts 1..4 get S-contexts 2/4/6/8.

A companion change rewrites the PLIC mask/unmask strategy to priority-based control (priority=0/1) rather than per-hart S-enable toggling, matching FreeBSD's approach. On FU740, clearing S-enable across the claim→complete window left the gateway unable to redeliver; QEMU was forgiving. Interrupt complete is also deferred for IST-dispatched sources: previously the trap handler wrote SCLAIM immediately after claiming, while the device was still asserting, producing one spurious IRQ per real IRQ once the IST unmasked. A plic_record_claim() / plic_complete_deferred() pair defers the complete to riscv_plic_unmask(), by which time the IST has drained the source.

Validated on the Unmatched: continuous IRQ 39 delivery with four consecutive drained=1 ISR cycles (previously alternating spurious cycles).


What Comes Next

With the Unmatched now running on its native UART with working PLIC interrupt delivery, the path to NVMe storage on real silicon is open. The PCI server is already running. The next release covers the NVMe driver.

No comments: