50/50 clean boots on QEMU. Eight CPUs, 10 kHz timer tick,
CONFIG_DEBUG_PRIL and CONFIG_DEBUG_VECTOR both enabled. This number
was 27/50 at the start of this release cycle. The other 23 were the
"fs-qrv hang" — a silent lockup at 100% host CPU, no crash banner, no
diagnostic output, just the system sitting there. Finding and fixing it was
the main event of v0.22, and the story of how it was found is worth telling
properly.
Everything else — PCI bring-up, system logging, event-driven serial RX, per-thread errno, a full warning cleanup from 559 to zero, and a dozen smaller things — happened alongside and after that hunt.
The Crown Jewel: Sharing the Kernel L2
What the hang looked like
When fs-qrv started — and only after fs-qrv started — the system would
sometimes freeze silently. Host CPU at 100%. No output. No crash banner.
The --disable-timer boot option (added during the investigation,
described below) provided the first clue: disabling the timer reduced
the hang rate but didn't eliminate it. Timer-driven preemption was a
necessary condition, not the whole story.
Eventually, a GDB session caught it in the act.
Six harts were spinning in the IPI_PARKIT for-loop at
trap_handler.c:69 — the code that parks secondary CPUs while one CPU
emits a crash dump. But there was no crash dump. Something had sent
IPI_PARKIT and then never followed through. Looking closer: two harts
were taking nested store page faults at the very first instruction of
__strap_entry — the register save in trap.S. Both faulting harts
shared the same satp value: a single user process's page table.
The 100% host CPU explained itself too: the parked harts loop on wfi
with sie.SEIE/sie.STIE still enabled, so external IRQs keep waking
them one instruction at a time, immediately re-entering wfi. A hall of
spinning mirrors, invisible and silent.
GDB couldn't read kernel-RAM virtual addresses through the faulting
satp — not _actives[2], not cpupage, nothing in the kernel data
segment. But it could fetch kernel text. The page table had kernel text
mapped but kernel data and stacks missing.
Both faulting sp values landed in L2 slot 1 of the Sv39 address space
— the range 0xffffffc040000000–0xffffffc080000000. That is the gigabyte
slot immediately below the RAM direct-map. The slot that would be needed
for any per-thread kernel stack allocated there.
The root cause
cpu_pgdir_create() — called every time a new user process is created —
copied the kernel root page table's L2 entries (indices 256..511) into the
new user pgdir. This is correct and necessary: user-space page tables need
the kernel mappings so the kernel can run after a trap while the user
process's satp is still loaded.
The problem is the word copied. It is a snapshot. A photograph taken at the moment the process was created.
Any kernel mapping installed after that snapshot — in particular, any per-thread kernel stack that happened to be allocated in a 1 GiB Sv39 slot that was empty at process-creation time — would populate the kernel's own L2 with a new PTE pointing to the L1 for that slot. But every user process created before that moment has a stale copy of the kernel L2 that still shows that slot as empty.
When the kernel subsequently runs under such a user satp and touches an
address in that slot — for example, during the trap prologue that saves
the faulting thread's registers onto a kernel stack — the MMU walks the
user pgdir's copy of L2, finds an empty entry, and raises a store page
fault. That fault is taken while already in S-mode, which cannot be
handled normally. The double-fault path sends IPI_PARKIT to all other
CPUs. If two harts hit this simultaneously, each parks the other, neither
prints a crash banner, and the system halts silently at 100% host CPU.
The fs-qrv hang was always this bug. fs-qrv was just the first process that allocated enough kernel threads to push a new stack into a previously-empty L2 slot, after enough other processes already existed with stale pgdirs.
The fix
At the end of init_mmu(), before enable_mmu(), walk every kernel L2
slot (indices 256..511). For any slot that is still empty, allocate an
empty L1 page and install it. From that point on, the kernel L2 has no
empty slots.
Now when cpu_pgdir_create() copies the kernel L2 into a new user pgdir,
it copies PTEs that point to real L1 pages — shared L1 pages. Any future
kernel mapping that adds L0/L1 entries within those already-valid L1 pages
is immediately visible in every user pgdir, because they all share the
same L1. The snapshot is now a snapshot of pointers to shared structure,
not a snapshot of the structure itself.
Cost: at most 256 × 4 KiB = 1 MiB of pinned RAM, allocated at startup. In practice, on our boot, 254 of 256 slots were empty and got populated. The other two already had the low-PA and RAM direct-map L1 pages.
Belt-and-braces: cpu_pgdir_create() now calls crash_halt() loudly if
any kernel L2 slot is empty when a user pgdir is created. A future
regression will be visible immediately rather than quietly rotting. A new
chk_kernel_sp() inline asserts that any value written to ker_stack[cpu]
or thp->trap_aux.kernel_sp lies within the direct-mapped RAM VA range.
Results:
| metric | before | after |
|---|---|---|
| PASS (shell prompt) | 27/50 | 50/50 |
| fs-qrv silent hang | intermittent | zero |
| 0x40116 ra crashes | 5/50 | zero |
The 0x40116 crash class — a pool thread's saved ra containing the
io_close_t type+combine_len constant — also vanished with the same fix.
The most likely explanation: those were the same bug, a half-completed
register save into a kernel stack that the running satp couldn't actually
reach, leaving the previous ra value unmodified.
The Tools That Found It
Two pieces of debug infrastructure were built during the hunt, both of which will remain useful permanently.
--disable-timer hard-disables the S-mode supervisor timer. Every
site that arms it becomes a no-op: arch_timer_load, arch_timer_reload,
cpu_idle's initial SIE enable, the per-dispatch sbi_set_timer, the
CLINT rearm in the trap handler. With the timer off, 0/10 boots crashed —
confirming that timer-driven preemption was a necessary condition for the
hang, and giving a clean environment for other debugging.
-T (tracelogger auto-start) allocates four 16 KiB trace ring buffers
at kernel boot time and enables all kercall and COMM event masks. On an
S-mode crash, the trap handler sends IPI_PARKIT to all other CPUs, waits
for a grace period so their output settles, then walks the ring and dumps
every event to the console. The last ~hundred kernel events before any
crash are now automatically attached to the crash log — no user-space
tracelogger daemon needed.
The first witness capture (trT_2.log) caught the 0x40116 crash
cleanly: 439 events, cpu 1, tid=8, with no interleaving from other CPUs.
Other Reliability Fixes
Several other SMP bugs were found and fixed during the same period.
PLIC lost-interrupt race. The PLIC trap handler was calling ready()
directly in interrupt context to wake the IST thread. If the IST was not
in STATE_INTR at IRQ time — already processing a previous interrupt in
user space — the wakeup was silently dropped. Boots hung on the first
virtio block I/O fairly often. Fix: route IST wakeups through
intrevent_add() with a synthesized SIGEV_INTR event, matching the
QNX/MIPS pattern. intrevent_drain() from __ker_exit under the kernel
lock then safely ready()s or sets the pending flag.
-P 1 boot hang. Four distinct gaps contributed. The virtio IST was
created at default priority 10, same as every other worker — it never
preempted. SIGEV_INTR priority wasn't bumped to force preemption at
equal priority. __kercall_dispatch's fast path returned directly via
__ker_restore without checking intrevent_pending, leaving queued events
to wait for the next syscall. The trap-return path similarly bypassed
__ker_exit, so interrupt-queued events didn't drain until the next kernel
entry. All four closed; -P 1 now boots 5/5.
PLIC enabled on halted harts. riscv_plic_unmask was enabling IRQs
on all physical harts from the FDT, including those sitting in OpenSBI's
halt loop under -P N. A halted hart can claim a PLIC interrupt and then
have no S-mode handler to process it — consuming the interrupt silently.
Fixed to enable only on sp->num_cpu (the OS-active harts).
SUM bit stripped on ecall. The S-mode ecall path was writing
act->reg.sstatus with just SPP|SPIE, stripping the SUM bit (which
permits S-mode to access user-page-protected memory). Any kercall handler
that subsequently touched a user VA — sync_create, anything using
user-space sync objects — faulted with scause=15. SUM and FS are
now preserved from the incoming sstatus.
AP startup missing SUM. The AP (application processor) startup path
in _smpstart never set SSTATUS_SUM. Only the boot CPU's init_cpu()
did. Secondary CPUs couldn't touch user pages during kercall handlers.
Per-Thread errno
The kernel's __get_errno_ptr() was returning the address of a single
global int errno shared across all S-mode threads — kernel and taskman
alike. Two threads on different CPUs could race in the three instructions
between a DebugProcess() call setting errno = -r and the caller reading
return errno. About 10% of boots showed pidin printing phantom thread
rows: DebugProcess returned ESRCH, another thread clobbered errno to 0
before procfs read it, pidin saw a successful devctl with a stale status
struct and printed num_threads*2+16 rows of garbage.
Fix: route __get_errno_ptr() through the current thread's
_thread_local_storage->__errptr via the tp register, falling back to
the global only before TLS is initialized. 0/40 phantom thread entries
after the fix.
Two companion fixes: clearing KERERR_SET/BUFF_MSG flags at kercall
entry (they were leaking across calls, causing cond_wait to abort
without blocking), and disabling KERERR_SKIPAHEAD (advancing sepc past
a nonexistent error branch in our libc syscall wrappers, corrupting the
caller's return value).
PCI Bring-Up
The PCI server is ported from QNX 6.6 (bios-v2 variant), split into
modules, stripped of all x86-specific code (Intel/AMD/VIA/SiS/NVIDIA
vendor dispatch, DLL loading), and cleaned to LP64 with zero warnings. It
registers /dev/pci as a QNX-style resource manager.
Startup now does a second FDT pass to find pci@/pcie@ nodes, extract
the ECAM base, memory/IO/prefetchable window ranges, and INTA IRQ number,
populating syspage with PCI memory windows via rsrcdbmgr.
lspci is a new userland utility — queries /dev/pci through the
message-passing protocol, lists all enumerated devices, supports -v for
BARs/IRQs/capabilities and -vv for config space hex dump.
On QEMU virt, two devices enumerate: host bridge (1b36:0008) and PCI
serial (1b36:0002). The rsrcdbmgr — which was a stub since v0.6 —
is now a full implementation ported from QNX, handling minor-device-number
allocations and named resource reservations.
This is the preparatory step for NVMe on the SiFive Unmatched.
System Logging
slogger is a log daemon ported from QNX, running as a user-space
resource manager at /dev/slog. Any process can call slogf() (a new
libc helper) to emit a log entry. sloginfo reads and filters the circular
buffer, with -w to wait for new events. slogger also serves
/dev/console and accepts a -t flag to tee text events to stdout.
Event-Driven Serial RX
devc-ser8250's ser_read used to poll for incoming data by sleeping
50 ms between retries via TimerTimeout(NANOSLEEP). This made the shell
unresponsive with --disable-timer (the main thread sat in
STATE_NANOSLEEP forever), and introduced up to 50 ms of keystroke
latency even with the timer on.
The replacement is a proper QNX-style deferred reply: ser_read checks
the RX buffer once. If a byte is already there, deliver it synchronously.
Otherwise, stash the rcvid in pending_rcvid and return
_RESMGR_NOREPLY — the resmgr framework doesn't reply, the main thread
drops back to MsgReceive. When the IST fires and delivers a byte via
pulse, rx_pulse_handler applies ICRNL+ECHO and calls MsgReply to the
saved rcvid directly. No timer involvement. Zero keystroke-to-echo latency
regardless of CONFIG_TIMER_HZ.
A double-check pattern closes the IST race: set pending_rcvid, then
re-read the RX buffer. If a byte appeared in that window, the IST's pulse
already fired — deliver synchronously and undo the stash.
Per-IRQ Fire Counter
struct interrupt_entry grows a uint64_t fire_count, incremented on
every PLIC and timer dispatch. Surfaced through debug_irq_t /
DCMD_PROC_IRQS / pidin -F irqs:
timer (vec 0x5) fires=~2500/s at 10 kHz tick
UART (vec 0x2a) fires climbing on each keystroke
virtio (vec 0x21) fires steady at idle
Distinguishing "IRQ not firing" from "IRQ fires but isn't reaching the IST" turns out to be exactly the question you need to answer when debugging driver initialization. This counter answers it directly.
Warning Cleanup: 559 → 0
559 compiler warnings at the start of this cycle. Zero at the end, with
CONFIG_DEBUG_PRIL=y + CONFIG_DEBUG_VECTOR=y and a heavy -W set.
Most of the work was in the trace/instrumentation path, where the original
QNX trace wire format uses 32-bit slots — correct for 32-bit, a source of
-Wsign-conversion and -Wconversion noise on LP64. The approach was to
make the narrowings explicit rather than silent: TRACE_PACK_PTR,
TRACE_PACK_INT, TRACE_PTR_LOW32 macros for the trace boundary; typed
helpers replacing generic __max/__min macros; (uint32_t) casts at
emit sites. Six sweeps across kernel, taskman, libc, and userland.
The CONFIG_DEBUG_PRIL cleanup also found a real bug: check_vec() had
K&R-style declaration and three LP64 truncations that turned vector entries
into 32-bit values; pril_validate() was dereferencing n->type before
checking n == NULL, so corrupted lists crashed the validator before it
could report the corruption.
libc Cleanup
The LP64-only pass through the support headers: _intsup.h trimmed from
200 to 42 lines (MSP430 __int20, GCC < 3.2 fallbacks, combinatorial
type-size detection — all gone); cdefs.h and features.h stripped of
Watcom, HighC, Metrowerks, pre-ANSI C, Cygwin, and RTEMS paths; platform.h
down to a clean LP64-only file; dead headers _ansi.h, _pack1.h,
_packpop.h, sched_aps.h deleted; the last directory named qnx/
renamed to qrv/.
time() now calls ClockTime(CLOCK_REALTIME) instead of returning 0.
vfprintf had a trailing-space bug: %02x with value 0 produced "00 "
instead of "00" — a missing LADJUST guard on the trailing pad.
Documentation
Three new chapters in the PortingStory document:
Chapter 5 covers the fault-service RPC — how PageWaitInfo and
PageCont bracket every invocation of taskman's fault handler, handing off
the faulting thread's identity and verdict between kernel and taskman.
Chapter 6 is the fs-qrv hang investigation written as a debugging
narrative: the GDB session, decoding the satp+sp+scause fingerprint,
the L2-snapshot root cause, and the fix.
SysArch Chapter 8 and Porting Chapter 2 cover the tracing and
instrumentation subsystem — the CONFIG_INSTRUMENTATION gate, the
parallel kercall dispatch table, the ring buffer format, and the crash
dumper.
What Comes Next
50/50. The baseline is solid. NVMe on the Unmatched is the next hardware
target — the PCI server is running on QEMU; connecting it to real PCIe
hardware and then to devb-nvme is the path to persistent storage on real
silicon.
No comments:
Post a Comment