QRV v0.23 closes a ~10% daemon-spawn hang that had survived v0.22, brings a native UART driver to the SiFive Unmatched, and fixes the PLIC interrupt delivery path for the FU740's heterogeneous hart layout. After v0.22 the system booted reliably but still hung with measurable frequency during the init script's driver startup sequence. That hang is gone.
The Stale SATP
What was happening
The hang pattern: after the L2 preallocation fix in v0.22, the system reached 50/50 clean boots under the controlled test conditions, but real-world daemon spawning — particularly during the parallel startup of fs-qrv, devb-virtio, and devc-ser8250 — produced silent hangs at roughly 10% frequency. Same signature as before: 100% host CPU, no crash banner, no stall detector firing.
A GDB session on April 22 caught it live. Three CPUs were parked in
IPI_PARKIT. One hart was trapped in a loop between trap.S:53 and
trap.S:203 — the trap prologue's register save, double-faulting on
stores into the kernel stack, over and over. Its satp pointed at a
physical page whose contents disassembled as RISC-V instructions, not page
table entries. The page sat near the top of RAM — exactly where modpkg ELF
content lives. aspaces_prp[1] and aspaces_prp[2] held the same
tProcess pointer, yet the two CPUs' satp CSRs pointed at different
physical pages.
The root cause
A kernel-level optimization made some time ago had modified the remote-CPU
arm of kerext_process_shutdown(): instead of the original IPI_TLB_SAFE
round-trip with try_again and spin-wait (the pattern from the openqnx
reference code), it rewrote aspaces_prp[i] inline and sent only
IPI_TLB_FLUSH. The stated rationale was that idle CPUs only touch kernel
memory, so a stale satp is harmless — the optimization looked reasonable
on the surface, and the immediate tests didn't catch it.
The rationale was accurate for about one millisecond. The moment
memmgr.mdestroy() ran and returned the dying process's pgdir page to the
physical allocator, the page was available for reuse. The next consumer was
typically loader_elf mapping an ELF segment for the next daemon to spawn.
The remote hart's satp now pointed at module code bytes; its next TLB
miss walked them as page table entries, decoded bogus PPNs, and
store-page-faulted trying to save registers on the kernel stack. The
double-fault path sent IPI_PARKIT to all other CPUs, and the system
locked up silently.
The lesson, now filed in the project's documentation: when ported reference
code does something that looks like a wasteful SMP round-trip, the default
assumption should be "it's a synchronisation point, not decoration". The
IPI_TLB_SAFE round-trip exists precisely to guarantee that the remote CPU
has truly swapped its satp away from the dying pgdir before mdestroy()
is allowed to reclaim the physical page.
The fix
Restore the original QNX pattern. Send IPI_TLB_SAFE, not IPI_TLB_FLUSH.
The remote hart's IPI handler calls set_safe_aspace(cpu), which calls
memmgr.aspace(sysmgr_prp, &aspaces_prp[cpu]), atomically updating both
the array slot and the remote hart's satp CSR. The ProcessShutdown
wrapper's existing spin-wait on ipicmds[cpu] & IPI_TLB_SAFE activates.
try_again=1 is set so kerext_process_shutdown returns EAGAIN and the
caller re-reads aspaces_prp — only when every remote hart has genuinely
switched away from the dying pgdir does shutdown proceed to mdestroy().
Verification: 50 runs, 9 seconds each, with the fix: 49/50 pass. The one survivor is an unrelated residual race in fs-qrv's own rtld path — a different class, not this one.
Stall Detector
Two improvements to the in-tree stall watchdog.
Threshold raised from 500 ms to 1 s. A shell waiting for keyboard input is legitimately silent for well over 500 ms between keystrokes. The old threshold produced a steady stream of false positives that drowned out the dumps we actually care about.
Kernel-held condition added. The watchdog now requires two conditions
to fire simultaneously: no kprintf activity for 1 s and the same CPU
continuously holding the kernel lock for at least 1 s. Post-prompt idle
has INKERNEL_NOW=0, so the kernel-held counter stays at zero and the
dump is suppressed even if kprintf is silent for minutes. A genuinely
wedged kernel call pegs the owner in inkernel for the duration, so the
watchdog catches it cleanly.
Also: stall_dump() force-releases kprintf_slock before its first
print. Defensive, but necessary: if the dump wins the CAS and another CPU
happens to be stuck mid-print holding that lock, without the release the
dump itself would spin forever — precisely the symptom under investigation.
A companion fix: the KERERR_SET/BUFF_MSG per-call flag clear that was
added to __kercall_dispatch (the kernel-internal syscall path) in v0.22
was missing from the parallel U-mode ECALL path in strap_handler_c. Both
paths call ker_call_table[num], but only one was clearing the flags.
The leak reproduced ker_sync_condvar_wait aborting without blocking —
virtio's pthread_cond_wait completion loop spinning forever while the IST
starved. A 100-run torture test after the fix: 0 early condvar returns.
SiFive Unmatched: Native UART Driver
Up to v0.22, the Unmatched's console ran through the in-kernel devtext
device, falling back from devc-ser8250 (which targets the ns16550a
register layout, absent on the FU740). v0.23 adds devc-sersifive — a
user-space resource manager for the SiFive FU540/FU740 UART IP block.
The architecture mirrors devc-ser8250: IST thread, MsgSendPulse, and
event-driven blocking reads via _RESMGR_NOREPLY / deferred MsgReply.
The register layout is different: 32-bit MMIO, TXDATA/RXDATA with
bit-31 full/empty flags, separate TXCTRL/RXCTRL, IE/IP, and DIV
for baud rate. The driver discovers its hardware via
hwinfo_find_device("sifive-uart", …) and exits silently if nothing is
found, so the init script can start both drivers and each finds its own
hardware or bows out.
Validated on the Unmatched: interactive shell reached; ls, pwd,
pidin irq, and shutdown all executed through the SiFive UART.
PLIC Context Mapping for the FU740
The PLIC register offsets for per-hart S-mode contexts were previously
computed as hart * 0x100 (S-enable) and hart * 0x2000 (S-priority /
S-claim). This formula assumes every hart has both M-mode and S-mode
contexts in (M, S) pairs starting from hart 0 — which holds on QEMU
virt, but not on the SiFive FU540/FU740.
On the FU740, hart 0 is the E-mode monitor core (S7, no MMU, no S-mode context). It contributes only one PLIC context, which shifts every subsequent S-mode context down by one. The OS was writing and reading the wrong PLIC context registers on the Unmatched, and interrupts were never arriving.
Fix: compute plic_s_ctx[hart] from the FDT cpu@N layout after
parse_fdt() and index SENABLE/SPRIORITY/SCLAIM by that table. On
QEMU: plic_s_ctx[N] = 2N+1 (unchanged behavior). On FU740: cpu@0 has
no mmu-type, contributes 1 context, so harts 1..4 get S-contexts
2/4/6/8.
A companion change rewrites the PLIC mask/unmask strategy to
priority-based control (priority=0/1) rather than per-hart S-enable
toggling, matching FreeBSD's approach. On FU740, clearing S-enable across
the claim→complete window left the gateway unable to redeliver; QEMU was
forgiving. Interrupt complete is also deferred for IST-dispatched sources:
previously the trap handler wrote SCLAIM immediately after claiming,
while the device was still asserting, producing one spurious IRQ per real
IRQ once the IST unmasked. A plic_record_claim() / plic_complete_deferred()
pair defers the complete to riscv_plic_unmask(), by which time the IST
has drained the source.
Validated on the Unmatched: continuous IRQ 39 delivery with four
consecutive drained=1 ISR cycles (previously alternating spurious cycles).
What Comes Next
With the Unmatched now running on its native UART with working PLIC interrupt delivery, the path to NVMe storage on real silicon is open. The PCI server is already running. The next release covers the NVMe driver.
No comments:
Post a Comment