In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Friday, April 03, 2026

QRV v0.20: SMP Stable, Drivers Work, Process Lifecycle Complete

v0.20 is the release where SMP stopped being a liability. The system now runs 20/20 clean SMP boot sequences — four CPUs, full IPC stack, drivers loading, shell responding — without crashes. That number, 20/20, took a while to reach and required fixing several genuinely interesting bugs in the kernel's trap entry, context switching, and timer handling. This post is mostly about those bugs, because they are the substance of this release.

The other significant pieces: a complete overhaul of the process exit path, a proper procmgr_detach mechanism for resource managers, FDT-based device discovery replacing hardcoded hardware addresses, and initial support for heterogeneous RISC-V SoCs. More on all of that below.


The SMP Bug Hunt

v0.19 left things in a state where single-CPU was mostly reliable but SMP was not — crashes ranged from act=NULL in the ecall handler to mysterious illegal instruction traps, to the kernel restoring a user thread with kernel pointers in its registers. The failures were non-deterministic, which is the characteristic signature of race conditions in the trap entry path.

There were several distinct bugs. Here they are in the order they were found and fixed, because the order matters for understanding what was actually going on.

The stale a7 deadlock

The first one sounds almost comically simple in retrospect. MsgReceive and MsgReceivePulse share a single kernel handler (ker_msg_receivev), which uses the a7 register — the syscall number — to distinguish between them via the KTYPE() macro. A pool thread blocked in MsgReceive has a7 = __KER_MSG_RECEIVEV. A pool thread that once called MsgReceivePulse has a7 = __KER_MSG_RECEIVEPULSEV.

The problem: when object_alloc recycles a dead thread's soul object for a new pool thread, it does not zero the register save area. So a new pool thread could inherit a7=24 (the pulse-receive syscall number) from whatever thread had previously occupied that slot. The kernel then marks it with QRV_ITF_RCVPULSE — pulse-only — and it becomes invisible to regular MsgSend. Over time, as threads were recycled, all pool threads accumulated this flag. No thread remained to handle regular messages. Clean deadlock, and an intermittent one because it depended on which threads got recycled in which order.

Fix: memset(&thp->reg, 0, sizeof(thp->reg)) at thread creation.

The actives[] race in kercall save

The __kercall_entry assembly for RISC-V was saving registers directly into actives[cpu]->reg. The problem: on SMP, another CPU's mark_running() can change actives[this_cpu] between the moment you read the pointer and the moment you finish writing the registers. You end up saving the ecall arguments into the wrong thread's register save area. The thread that actually made the ecall never gets its registers saved correctly.

Fix: __kercall_entry now saves to a per-CPU kernel stack first. __kercall_dispatch copies the frame to act->reg after acquiring the kernel lock, at which point actives[cpu] is stable.

The sscratch redesign

The deeper structural issue was the trap entry convention. QRV had been using actives[] to find the current thread on U-mode trap entry — a lookup that is inherently racy on SMP for the same reason. The fix was to adopt the Linux-style sscratch convention:

  • U-mode running: sscratch = &thp->reg (the current thread's register save area, directly)
  • S-mode running: sscratch = 0

On U-mode trap entry, registers are saved directly into thp->reg via the sscratch pointer — no actives[] lookup, no copy. The thread pointer is known at the moment the trap fires, because it was stored in sscratch before the sret that entered U-mode.

This is a fundamental restructuring of trap.S, kernel.S, and ker_exit.c, and it eliminated the race that was causing a U-mode thread's registers to be written into the wrong slot.

The tp/TLS corruption

After the sscratch redesign, a different SMP bug surfaced. get_cpunum() — which needs to know which hart a piece of code is running on — was implemented by reading through the tp register: tp → TLS.__reserved2[0] → cpupage → cpu. The issue is that tp points to a thread's TLS, not a per-CPU resource. When __ker_exit dispatches an S-mode thread via __ker_restore, it loads tp from that thread's TLS. If the thread later blocks and migrates to another CPU, both CPUs now share the same TLS — racing on the cpupage pointer at __reserved2[0].

Symptoms: idle_pick_thread asserting "cpu=X but inkernel owner=Y", act=NULL crashes in the ecall handler, get_cpunum() returning the wrong hart ID on secondary CPUs.

Fix: introduce idle_tls[PROCESSORS_MAX] — one TLS per CPU, populated at init_objects() time from the idle thread. This TLS is private to its CPU and never shared with migrating threads. On the thp==NULL path in ker_exit.c, tp switches to idle_tls[cpu] before calling cpu_idle.

Unified trap frame

While all the above was being fixed, the trap entry assembly was also cleaned up structurally. The old trap.S had separate U-mode and S-mode save paths with different frame layouts and raw numeric offsets. A stale bug: t4 was clobbered with la t4, ker_stack before being saved, meaning the user's actual t4 value was lost on entry.

The rewrite introduced a single entry point (__strap_entry), a shared frame layout matching RISCV_CPU_REGISTERS (272 bytes), symbolic constants in trap_asm.h, and both paths saving all 31 GPRs. The S-mode partial-save bug — where callee-saved registers were lost on preemption — is gone.

Per-CPU clock handler

The last piece of the SMP stability puzzle was the timer handler. The original clock_handler ran on all CPUs but processed a global loop, which could corrupt the timer queue if two CPUs entered it simultaneously. The fix matches RISC-V hardware reality: each hart has its own independent timer (mtimecmp), so each CPU now processes only its own active thread. Global time update and timer_expiry happen on the boot CPU only. When am_inkernel() is true for this CPU, the handler takes a minimal path — just sbi_set_timer, nothing else — preventing reentrancy on the same-CPU kernel stack.

After all of the above: 20/20 clean SMP runs. The crashes are gone.


Process Lifecycle Overhaul

In parallel with the SMP work, the process exit path was rebuilt from scratch. The old design used a terminator thread — a dedicated thread that woke up to clean a dying process's resources, an approach inherited from QNX 6.4's multi-binary architecture. It was the source of several deadlocks in earlier versions.

The new design: _exit() sends a _PROC_EXIT message to procmgr. A pool thread handles it through procmgr_exit, which performs cleanup in ten explicit steps — threads, events, channels, timers, file descriptors, connections, conf vars, memory, ProcessShutdown, parent notification. No terminator thread, no separate cleanup pulse, no KILLSELF trampoline.

Abnormal termination (where a process dies without calling _exit(), e.g. because the loader failed) was a separate problem: the process_threads_destroyed kernel callback had been removed in an earlier cleanup commit, leaving no notification path to taskman for this case. Restored, guarded with !QRV_FLG_PROC_TERMING to avoid duplicate pulses when procmgr_exit is already running.

Zombie reaping: when a child exits before the parent calls waitpid(), procmgr now sends a PROC_CODE_CHILD_EXITED pulse to the parent's system thread channel. The system thread buffers the {pid, status} pair in user space. waitpid() checks that buffer first before making a procmgr round-trip. The "zombie" kept alive until the parent reaps it is just a bare tProcess entry — all resources already freed.

procmgr_detach

Resource managers need to detach from their parent process after initialization and before entering their dispatch loop — otherwise the parent (the init script) blocks forever in waitpid. The old workaround was launching drivers with & in the init script. That is gone.

The new _PROC_DETACH message and procmgr_detach(int status) API handle this properly: the parent's waitpid() is unblocked with the given status, and the resource manager is reparented to pid 1. devc-ser8250 and devb-virtio now call procmgr_detach(EXIT_SUCCESS) before entering their dispatch loops.


Driver Infrastructure

FDT-based device discovery

Up to v0.19, device drivers used hardcoded physical addresses for their hardware registers. That works for QEMU's virt machine, where the memory map is fixed and documented, but it is not how a real system works.

v0.20 adds a second FDT pass in init_hwinfo() that walks the device tree, discovers ns16550a and virtio,mmio nodes, and populates the syspage hwinfo section with base addresses, sizes, and IRQ numbers. Drivers then call hwinfo_find_device() to discover their hardware, falling back to command-line flags if needed. Both devc-ser8250 and devb-virtio use this path now. The system should work on any board with a correct device tree — QEMU virt and SiFive Unmatched alike — without source changes.

The virtio DMA collision

devb-virtio was allocating its DMA buffer at a hardcoded physical address: 0x8FF00000. The kernel's page-table allocator had no knowledge of this reservation and could place L2 page-table pages at the same PA. When the virtio device wrote to its DMA buffer it was overwriting the process's page table entries. The resulting failures — stack page faults, illegal instruction traps — were non-deterministic and looked completely unrelated to the driver.

Fix: use MAP_ANON for the DMA buffer and mem_offset() to discover its physical address. The memory manager picks a conflict-free PA; the driver then gives that address to the virtio device. mem_offset64() was added to libc for this purpose.

Heterogeneous SoCs: SiFive U740

The SiFive Unmatched board uses the U740 SoC, which has five harts: one S7 monitor core (hart 0) without MMU support and four U74 application cores with Sv39. The FDT parser was previously adding every cpu@N node to hart_ids[] unconditionally, which would crash when QRV tried to enable Sv39 or execute FPU instructions on the S7.

Fix: a hart is now only accepted if the FDT lists a mmu-type property and its value is not riscv,none. Any of sv39, sv48, or sv57 pass. On the Unmatched, hart 0 is silently skipped with a diagnostic. On QEMU virt (all harts have mmu-type=riscv,sv39), nothing changes.


Cross-Address-Space Message Passing

pageman_map_xfer — the function that translates user virtual addresses to kernel-accessible addresses for cross-process MsgSend — was a stub returning 0 in earlier versions. That worked as long as sender and receiver shared an address space, which is not the general case.

v0.20 replaces it with a real implementation: walk the target process's Sv39 page table, translate each user VA to a kernel direct-map VA (PA + KERNEL_VIRT_BASE), and build output IOV entries that xfermsg can copy through. The companion fix in nano_xfer.c handles the loader case: the loader thread runs directly in the child process via ThreadCreate_r, so aspace_prp is NULL but the address space still differs from the server's.


Other Changes Worth Noting

Stack guard pages. Kernel per-CPU stacks and taskman pool thread stacks now have guard pages: the bottom page's PTE is invalidated, so a stack overflow produces a clean page fault rather than silent corruption. The guard page logic walks the Sv39 table and splits megapages where needed.

Source tree reorganization. Kernel and taskman internal headers moved out of include/ — which now contains only the user-space ABI — and into kernel/include/ and taskman/include/. Architecture sources migrated from arch/riscv/ to kernel/arch/riscv/. A cleanup that should have happened earlier; the public/private boundary is now explicit.

Documentation. A start on formal documentation: a SysArch document covering the RISC-V ecall problem (S-mode cannot trap to itself), __kercall_entry, scheduling, context switching, and the Ring0 mechanism. A KerCallRef LaTeX reference for individual kernel calls, with entries for InterruptAttachThread and ConnectAttach.

pidin — the process information utility — now works without a feedback loop. The old implementation iterated thread IDs until ESRCH, but each query caused the thread pool to briefly spawn new threads to handle it, which pidin would then discover and keep scanning. Capped at 2× num_threads from the initial snapshot.


What Comes Next

The immediate targets: SMP timer handling improvements (deferring clock_handler to __ker_exit for cleaner reentrancy), more user-space utilities, and getting the system running on real SiFive Unmatched hardware rather than just QEMU.

No comments: