In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Wednesday, May 06, 2026

QRV v0.34–v0.40: Taskman Moves to User Mode

Since the project began, taskman has run in S-mode — the RISC-V supervisor mode, shared with the kernel. It was kerlinked into the kernel address space at boot, called kernel functions by name, and reached kernel data structures directly. This was inherited from QNX's procnto-as-kernel-module model and it worked, but it was architecturally wrong: there is no meaningful trust boundary between the kernel and taskman when both run in the same address space under the same privilege level.

v0.40 completes the migration of taskman to U-mode. Taskman now runs as an ordinary user-space process at address 0xC0000000, with its own page tables, communicating with the kernel exclusively through the syscall_tm_priv mechanism. The kernel heap has PTE_U=0. A bug in taskman can no longer corrupt kernel data by accident.

This took about a week of intensive work spread across versions 0.34 through 0.40. Here is how it went.


v0.34–v0.35 — Groundwork: Relocating Bodies, Retiring __Ring0

The first problem was naming. The historical __Ring0(func_ptr, arg) primitive — kernel calls a function pointer supplied by taskman, executed with kernel privilege — was an x86 concept with x86 terminology. RISC-V has no rings. And beyond the name, the mechanism was architecturally dangerous: the kernel jumped to whatever address taskman supplied, with no whitelist beyond a capability flag.

syscall_tm_priv replaces __Ring0. The new mechanism: <sys/kercalls.h> defines a __KER_TM_* sub-namespace of 53 enum constants. syscall_tm_priv(id, arg) is a plain ecall that dispatches through a static kernel-owned table (ker_tm_priv.c), gated on QRV_FLG_PROC_TM_PRIV. The table is a whitelist. A _Static_assert catches enum/table drift at compile time. Every EXPORT_SYMBOL(kerext_*) entry removed from the kernel's symbol table — the kerext bodies are truly kernel-internal now.

This rename was a tree-wide sweep: ~90 call sites, one sed pass.

Phase 0a–0c: kerext bodies moved to kernel. Before taskman text can be marked PTE_U=1, every function that the kernel calls through a function pointer must live in kernel text. S-mode instruction fetch from PTE_U=1 pages is illegal regardless of SUM. Three phases relocated the bodies:

  • Phase 0a: kerext_reparent, SignalKill, and the five exit-path bodies (kerext_pulse_deliver, exit_destroy_threads, channel_destroy, timer_destroy, connect_detach) moved to kernel/kext/.
  • Phase 0b: sysaddr_map, ext_vaddrinfo, ker_manipulate moved via a cross-boundary helper table — the bodies called back into taskman helpers (do_manipulation, pte_map) that still lived in taskman text.
  • Phase 0c: The entire pa allocator cluster (kerext_pa_alloc, _alloc_given, _free, _free_info) moved alongside a new kerext_mm_xboundary table registering 10 function pointers to taskman-side helpers.

Phase 0 ends with the condition: every __Ring0 callback body lives in kernel text. The SPP-flip preconditions are in place.

Phase 1a–1c: arch helpers moved to kernel. pageman_aspace — called on every context switch via memmgr_p->aspace — moved to kernel/arch/riscv/pageman_aspace.c. The Sv39 PTE engine (cpu_pte_manipulate, cpu_pte_merge, cpu_pte_split, sv39_walk, prot_to_pte_bits) moved to kernel/arch/riscv/cpu_pte.c. cpu_sysvaddr_find and cpu_pageman_vaddrinfo moved alongside. A kx_* indirection that was now pointing kernel symbol to kernel symbol was collapsed to direct calls.


v0.35 — pa.c Migrates to the Kernel

The physical-quantum allocator had always been semantically wrong in taskman: physical memory belongs in the kernel by every microkernel convention. After Phase 0c moved the kerext bodies, moving the allocator itself followed naturally.

kernel/kext/pa.c (~1900 lines) now carries all of it: the global state (blk_head[], mem_free_size, mem_reserved_size, restriction chain, quantum pool), the pure helpers (pa_carve, _pa_free, pa_quantum_to_paddr, enqueue_run, dequeue_run), init code, and six new __KER_TM_PA_* slots.

Taskman's pa.c was replaced with a thin syscall-wrapper file (~380 lines). The parts that stayed taskman-side: pa_alloc_fake / pa_free_fake (fake quantum allocator using taskman heap), the high-level pa_alloc retry/purger loop, and pa_free_info's restriction-filter walk.

Six kernel-data globals that taskman had been reaching into directly (mem_free_size, blk_head, etc.) now live on the kernel side and are accessed only through kerext calls. Net leak from kernel data into taskman binary: zero.


v0.36 — kernel/mm/ Promoted to First Class

The kernel needed its own permanent memory manager — one that lives in S-mode kernel text and is callable without indirecting through taskman. Pre-v0.36 the kernel had emm.c ("early memory manager"), a bootstrap stub that handed off to taskman's pageman table at startup. After the U-mode flip, memmgr_p->FOO calls could no longer land in taskman text.

kernel/emm.ckernel/mm/mm.c, symbols emm_*mm_*. init_memmgr wires the full table (mmap, munmap, vaddr_to_memobj, vaddrinfo, mcreate, mdestroy, aspace, pagesize) to kernel-resident symbols:

  • mm_vaddrinfo (new): read-only Sv39 walk returning the leaf PTE, using cpu_pte_lookup. Correct and conservative; no memobj/mm_map walk.
  • mm_mcreate / mm_mdestroy (new): process-aspace setup via cpu_pgdir_create. Allocates pgdir + tAddress. Deliberately omits the mm_map list, rwlock, rlimit — taskman owns that bookkeeping.

kerext_register_memmgr and kerext_register_procmgr defanged to no-ops. The kernel never indirects through taskman text again.


v0.37–v0.38 — First U-Mode Boot and Its Bugs

With the groundwork in place, the Phase 4 work began: actually running taskman in U-mode. This was not a clean single commit — it was a debugging session across multiple evenings, each commit fixing the current crash and revealing the next one.

Phase 4a (v0.34): The predicate for S→U-mode thread entry changed from !QRV_FLG_PROC_RING0 to !QRV_FLG_PROC_LOADING. The distinction matters: taskman has RING0 set permanently (it's a privileged process), but its pool threads should run U-mode. Loader threads — which run in the child process's context during spawn — should stay S-mode until ProcessStartup promotes them.

Phase 4b: Marking taskman text PTE_U=1 in kerlink. This made taskman reachable from U-mode but also made it readable from every user process (the kernel L2 is shared). Acknowledged as a transitional step; Phase 5's ET_DYN taskman in its own address space fixes the isolation.

The first boot to message_start: Eight orthogonal fixes in one commit that together carried taskman through all init phases in U-mode: PTE_U=1 on the kernel heap, SUM set early, a kerext for reading satp from U-mode, CPIO ramdisk PA → high VA conversion, rlimits initialised for taskman, PROC_LOADING cleared after the first taskman thread, pool stacks pre-allocated (96 × 8 KiB via pageman_mmap before thread_pool_start), and aspace_prp = prp set uniformly for all pool threads.

main() must not return (v0.38). Taskman is now a real ET_EXEC. When taskman_main returns, libc's crt0 calls exit()_exit()_PROC_EXIT → procmgr → ProcessThreadsDestroy → tear down all of taskman's threads. Fix: pthread_exit(NULL) instead of return.

The NOZOMBIE race (v0.38). A subtle spawn-ordering bug: every process is created with QRV_FLG_PROC_NOZOMBIE set. Before the U-mode flip, proc_start cleared it for normal spawns. After the flip, kerext_register_procmgr is a no-op, so procmgr_p->process_start is NULL, and NOZOMBIE stays set permanently. Every parent's waitpid returns ECHILD immediately. The init script became fire-and-forget instead of synchronous. Fix: replicate the NOZOMBIE clearing in proc_loader.c.

mm_map_xfer wired (v0.37). memmgr_p->map_xfer had been NULL since the kernel/mm/ promotion — left as a stub with an explicit "deferred" comment. The first time a pool thread's msg_replyv had a cross-aspace destination, the kernel called NULL. The fault chain was interesting: the NULL jalr → PC=0 → xfer fault recovery path → _longjmp into a stale jmp_buf whose bytes held kprintf output → PC = 0xffffffc0000a3036 → second instruction page fault → "XFER BUG" banner. mm_map_xfer is now a proper Sv39 walk.

LP64 intrinfo truncation (v0.38). init_intrinfo runs before switch_to_high_va. GCC under -mcmodel=medany resolves symbol addresses PC-relative at the PA, not the linked VA. The stored mask/unmask function pointers were PA values (0x80207b88) instead of kernel VAs (0xffffffc080207b88). Every subsequent ilp->info->unmask() call jumped to low-half user space, which isn't mapped. Fix: KVA(sym) — OR the symbol address with KERNEL_VIRT_BASE.


v0.39–v0.40 — Phase 5: The Real Architecture

Phase 5 was the fundamental ABI redesign. All of the Phase 4 work had taskman running in U-mode while still touching kernel heap data directly (the heap was PTE_U=1). Phase 5 flips the heap to PTE_U=0 and builds the full structure to make U-mode taskman correct.

tTMprocess: taskman-side process state. The kernel-allocated tProcess contains kernel-managed fields. But it also carried POSIX-policy fields (pgrp, sid, session, umask, root, cwd, guardian, siginfo, events, resource lists, conf table) that belong in taskman. Under PTE_U=0, taskman can no longer dereference a tProcess * for any of these.

The solution: taskman/include/tm_process.h defines tTMprocess — a fixed-size BSS array of TM_PROCESS_MAX slots indexed by PINDEX(pid). Every POSIX-policy field migrated from kernel-half tProcess to tTMprocess. The locking state for the address space (rwlock_lock, fault_owner) migrated from tAddress to tTMprocess. The kernel retains only what it genuinely owns: scheduling state, IPC vectors, trap/fault state, credentials.

KERNEL_INTERNAL struct gating. Every kernel-private struct body (tProcess, tThread, tChannel, tConnect, and 15 others) is now gated behind #ifdef KERNEL_INTERNAL. Kernel sources define KERNEL_INTERNAL and see full struct bodies. Taskman sees only the typedef pointer-to-incomplete. Any prp->X deref in taskman code is now a compile error — by design.

pa_pq accessors. struct pa_quantum lives on the kernel heap (PTE_U=0). Direct field access from U-mode taskman faults. The new pa_pq_flags() / pa_pq_blk() / pa_pq_run() / pa_pq_modify_flags() inline wrappers route through __KER_TM_PA_PQ_OP.

80 TM_PRIV slots. The dispatch table grew from 53 to 80 entries during Phase 5. 25 new slots cover everything taskman needs to do that involves kernel data: PA_ALLOC_FULL, PA_PQ_OP, ASPACE_MEMCPY, KPAGE_ZERO, REGISTER_POOL_STACK, PROC_QUERY, PROC_SET_SESSION, CHANNEL_HAS_RECEIVERS, MMAP_PHYS_USER, MMAP_ANON_USER, and more.

taskman linked as ET_EXEC at 0xC0000000. The old ET_DYN-against-libc.qrl build required libc.qrl to live in the kernel heap (PTE_U=0 after Phase 5 → taskman can't reach it). The new build: -nostdlib -static -Ttext-segment=0xC0000000, libc.a inside the link group. Every libc symbol now lives inside taskman's own image. No DT_NEEDED, no PLT/GOT for libc.

0xC0000000 is Sv39 L2[3]. cpu_pgdir_create copies only L2[1] and L2[256..511] into every user pgdir. L2[3] stays zero in every user process — taskman's image is invisible to other processes.

The loader thread problem. During spawn, a loader thread runs in the child process's context but needs to fetch taskman text. The solution: mm_taskman_view_install(child_prp) temporarily copies taskman's L1 page pointer into the child's pgdir at L2[3]. ProcessStartup calls mm_taskman_view_remove(child_prp) before the first U-mode sret into the child — at that point the child is fully isolated.

Pool stacks registered by taskman. The kernel's procmgr_stack_alloc fallback to memmgr_p->mmap was removed — the kernel heap is PTE_U=0, unusable as a user thread stack. Taskman pre-allocates 96 user-VA stacks via TaskmanMmapAnonUser and pushes each into the kernel's free list via __KER_TM_REGISTER_POOL_STACK. The kernel contributes only the free-list structure; taskman contributes the memory.

826 → 0 warnings. The final commit of the cycle drove the warning count from 826 to zero. 12 were genuine QRV-proper issues fixed at the root. 814 were -Wconversion in the mksh fork — suppressed at the Makefile level with an explicit rationale: the mksh fork is diverging and will eventually be renamed qsh; fixing 800 upstream conversion sites would cost effort that belongs in the architectural work instead.


What This Means

Taskman is now an ordinary user-space process. It communicates with the kernel through a defined, whitelist-gated ecall interface. A bug in taskman — a corrupted pointer, an out-of-bounds write — cannot reach kernel data. The KERNEL_INTERNAL guard makes this a compile-time property, not just a runtime one: the compiler rejects any attempt to dereference a kernel struct from taskman code.

The BKL work and the U-mode work ran on sequential branches. They are now merged into a single tree. The system boots, logs in, and runs on real hardware with both changes in place.

No comments: