v0.20 is the release where SMP stopped being a liability. The system now runs 20/20 clean SMP boot sequences — four CPUs, full IPC stack, drivers loading, shell responding — without crashes. That number, 20/20, took a while to reach and required fixing several genuinely interesting bugs in the kernel's trap entry, context switching, and timer handling. This post is mostly about those bugs, because they are the substance of this release.
The other significant pieces: a complete overhaul of the process exit path,
a proper procmgr_detach mechanism for resource managers, FDT-based device
discovery replacing hardcoded hardware addresses, and initial support for
heterogeneous RISC-V SoCs. More on all of that below.
The SMP Bug Hunt
v0.19 left things in a state where single-CPU was mostly reliable but SMP
was not — crashes ranged from act=NULL in the ecall handler to mysterious
illegal instruction traps, to the kernel restoring a user thread with kernel
pointers in its registers. The failures were non-deterministic, which is the
characteristic signature of race conditions in the trap entry path.
There were several distinct bugs. Here they are in the order they were found and fixed, because the order matters for understanding what was actually going on.
The stale a7 deadlock
The first one sounds almost comically simple in retrospect. MsgReceive and
MsgReceivePulse share a single kernel handler (ker_msg_receivev), which
uses the a7 register — the syscall number — to distinguish between them
via the KTYPE() macro. A pool thread blocked in MsgReceive has a7 = __KER_MSG_RECEIVEV. A pool thread that once called MsgReceivePulse has
a7 = __KER_MSG_RECEIVEPULSEV.
The problem: when object_alloc recycles a dead thread's soul object for a
new pool thread, it does not zero the register save area. So a new pool
thread could inherit a7=24 (the pulse-receive syscall number) from
whatever thread had previously occupied that slot. The kernel then marks it
with QRV_ITF_RCVPULSE — pulse-only — and it becomes invisible to regular
MsgSend. Over time, as threads were recycled, all pool threads accumulated
this flag. No thread remained to handle regular messages. Clean deadlock,
and an intermittent one because it depended on which threads got recycled in
which order.
Fix: memset(&thp->reg, 0, sizeof(thp->reg)) at thread creation.
The actives[] race in kercall save
The __kercall_entry assembly for RISC-V was saving registers directly into
actives[cpu]->reg. The problem: on SMP, another CPU's mark_running() can
change actives[this_cpu] between the moment you read the pointer and the
moment you finish writing the registers. You end up saving the ecall
arguments into the wrong thread's register save area. The thread that
actually made the ecall never gets its registers saved correctly.
Fix: __kercall_entry now saves to a per-CPU kernel stack first.
__kercall_dispatch copies the frame to act->reg after acquiring the
kernel lock, at which point actives[cpu] is stable.
The sscratch redesign
The deeper structural issue was the trap entry convention. QRV had been
using actives[] to find the current thread on U-mode trap entry — a lookup
that is inherently racy on SMP for the same reason. The fix was to adopt the
Linux-style sscratch convention:
- U-mode running:
sscratch=&thp->reg(the current thread's register save area, directly) - S-mode running:
sscratch= 0
On U-mode trap entry, registers are saved directly into thp->reg via the
sscratch pointer — no actives[] lookup, no copy. The thread pointer is
known at the moment the trap fires, because it was stored in sscratch
before the sret that entered U-mode.
This is a fundamental restructuring of trap.S, kernel.S, and
ker_exit.c, and it eliminated the race that was causing a U-mode thread's
registers to be written into the wrong slot.
The tp/TLS corruption
After the sscratch redesign, a different SMP bug surfaced. get_cpunum() —
which needs to know which hart a piece of code is running on — was
implemented by reading through the tp register: tp → TLS.__reserved2[0] → cpupage → cpu. The issue is that tp points to a thread's TLS, not a
per-CPU resource. When __ker_exit dispatches an S-mode thread via
__ker_restore, it loads tp from that thread's TLS. If the thread later
blocks and migrates to another CPU, both CPUs now share the same TLS —
racing on the cpupage pointer at __reserved2[0].
Symptoms: idle_pick_thread asserting "cpu=X but inkernel owner=Y",
act=NULL crashes in the ecall handler, get_cpunum() returning the wrong
hart ID on secondary CPUs.
Fix: introduce idle_tls[PROCESSORS_MAX] — one TLS per CPU, populated at
init_objects() time from the idle thread. This TLS is private to its CPU
and never shared with migrating threads. On the thp==NULL path in
ker_exit.c, tp switches to idle_tls[cpu] before calling cpu_idle.
Unified trap frame
While all the above was being fixed, the trap entry assembly was also
cleaned up structurally. The old trap.S had separate U-mode and S-mode
save paths with different frame layouts and raw numeric offsets. A stale
bug: t4 was clobbered with la t4, ker_stack before being saved,
meaning the user's actual t4 value was lost on entry.
The rewrite introduced a single entry point (__strap_entry), a shared
frame layout matching RISCV_CPU_REGISTERS (272 bytes), symbolic constants
in trap_asm.h, and both paths saving all 31 GPRs. The S-mode partial-save
bug — where callee-saved registers were lost on preemption — is gone.
Per-CPU clock handler
The last piece of the SMP stability puzzle was the timer handler. The
original clock_handler ran on all CPUs but processed a global loop, which
could corrupt the timer queue if two CPUs entered it simultaneously. The fix
matches RISC-V hardware reality: each hart has its own independent timer
(mtimecmp), so each CPU now processes only its own active thread. Global
time update and timer_expiry happen on the boot CPU only. When
am_inkernel() is true for this CPU, the handler takes a minimal path —
just sbi_set_timer, nothing else — preventing reentrancy on the same-CPU
kernel stack.
After all of the above: 20/20 clean SMP runs. The crashes are gone.
Process Lifecycle Overhaul
In parallel with the SMP work, the process exit path was rebuilt from scratch. The old design used a terminator thread — a dedicated thread that woke up to clean a dying process's resources, an approach inherited from QNX 6.4's multi-binary architecture. It was the source of several deadlocks in earlier versions.
The new design: _exit() sends a _PROC_EXIT message to procmgr. A pool
thread handles it through procmgr_exit, which performs cleanup in ten
explicit steps — threads, events, channels, timers, file descriptors,
connections, conf vars, memory, ProcessShutdown, parent notification. No
terminator thread, no separate cleanup pulse, no KILLSELF trampoline.
Abnormal termination (where a process dies without calling _exit(), e.g.
because the loader failed) was a separate problem: the
process_threads_destroyed kernel callback had been removed in an earlier
cleanup commit, leaving no notification path to taskman for this case.
Restored, guarded with !QRV_FLG_PROC_TERMING to avoid duplicate pulses
when procmgr_exit is already running.
Zombie reaping: when a child exits before the parent calls waitpid(),
procmgr now sends a PROC_CODE_CHILD_EXITED pulse to the parent's system
thread channel. The system thread buffers the {pid, status} pair in user
space. waitpid() checks that buffer first before making a procmgr
round-trip. The "zombie" kept alive until the parent reaps it is just a bare
tProcess entry — all resources already freed.
procmgr_detach
Resource managers need to detach from their parent process after
initialization and before entering their dispatch loop — otherwise the parent
(the init script) blocks forever in waitpid. The old workaround was
launching drivers with & in the init script. That is gone.
The new _PROC_DETACH message and procmgr_detach(int status) API handle
this properly: the parent's waitpid() is unblocked with the given status,
and the resource manager is reparented to pid 1. devc-ser8250 and
devb-virtio now call procmgr_detach(EXIT_SUCCESS) before entering their
dispatch loops.
Driver Infrastructure
FDT-based device discovery
Up to v0.19, device drivers used hardcoded physical addresses for their hardware registers. That works for QEMU's virt machine, where the memory map is fixed and documented, but it is not how a real system works.
v0.20 adds a second FDT pass in init_hwinfo() that walks the device tree,
discovers ns16550a and virtio,mmio nodes, and populates the syspage
hwinfo section with base addresses, sizes, and IRQ numbers. Drivers then
call hwinfo_find_device() to discover their hardware, falling back to
command-line flags if needed. Both devc-ser8250 and devb-virtio use
this path now. The system should work on any board with a correct device
tree — QEMU virt and SiFive Unmatched alike — without source changes.
The virtio DMA collision
devb-virtio was allocating its DMA buffer at a hardcoded physical address:
0x8FF00000. The kernel's page-table allocator had no knowledge of this
reservation and could place L2 page-table pages at the same PA. When the
virtio device wrote to its DMA buffer it was overwriting the process's page
table entries. The resulting failures — stack page faults, illegal
instruction traps — were non-deterministic and looked completely unrelated to
the driver.
Fix: use MAP_ANON for the DMA buffer and mem_offset() to discover its
physical address. The memory manager picks a conflict-free PA; the driver
then gives that address to the virtio device. mem_offset64() was added to
libc for this purpose.
Heterogeneous SoCs: SiFive U740
The SiFive Unmatched board uses the U740 SoC, which has five harts: one S7
monitor core (hart 0) without MMU support and four U74 application cores
with Sv39. The FDT parser was previously adding every cpu@N node to
hart_ids[] unconditionally, which would crash when QRV tried to enable
Sv39 or execute FPU instructions on the S7.
Fix: a hart is now only accepted if the FDT lists a mmu-type property and
its value is not riscv,none. Any of sv39, sv48, or sv57 pass. On the
Unmatched, hart 0 is silently skipped with a diagnostic. On QEMU virt (all
harts have mmu-type=riscv,sv39), nothing changes.
Cross-Address-Space Message Passing
pageman_map_xfer — the function that translates user virtual addresses to
kernel-accessible addresses for cross-process MsgSend — was a stub
returning 0 in earlier versions. That worked as long as sender and receiver
shared an address space, which is not the general case.
v0.20 replaces it with a real implementation: walk the target process's Sv39
page table, translate each user VA to a kernel direct-map VA (PA +
KERNEL_VIRT_BASE), and build output IOV entries that xfermsg can copy
through. The companion fix in nano_xfer.c handles the loader case: the
loader thread runs directly in the child process via ThreadCreate_r, so
aspace_prp is NULL but the address space still differs from the server's.
Other Changes Worth Noting
Stack guard pages. Kernel per-CPU stacks and taskman pool thread stacks now have guard pages: the bottom page's PTE is invalidated, so a stack overflow produces a clean page fault rather than silent corruption. The guard page logic walks the Sv39 table and splits megapages where needed.
Source tree reorganization. Kernel and taskman internal headers moved
out of include/ — which now contains only the user-space ABI — and into
kernel/include/ and taskman/include/. Architecture sources migrated from
arch/riscv/ to kernel/arch/riscv/. A cleanup that should have happened
earlier; the public/private boundary is now explicit.
Documentation. A start on formal documentation: a SysArch document
covering the RISC-V ecall problem (S-mode cannot trap to itself),
__kercall_entry, scheduling, context switching, and the Ring0 mechanism.
A KerCallRef LaTeX reference for individual kernel calls, with entries for
InterruptAttachThread and ConnectAttach.
pidin — the process information utility — now works without a
feedback loop. The old implementation iterated thread IDs until ESRCH,
but each query caused the thread pool to briefly spawn new threads to handle
it, which pidin would then discover and keep scanning. Capped at 2×
num_threads from the initial snapshot.
What Comes Next
The immediate targets: SMP timer handling improvements (deferring
clock_handler to __ker_exit for cleaner reentrancy), more user-space
utilities, and getting the system running on real SiFive Unmatched hardware
rather than just QEMU.
No comments:
Post a Comment