Since the project began, taskman has run in S-mode — the RISC-V supervisor mode, shared with the kernel. It was kerlinked into the kernel address space at boot, called kernel functions by name, and reached kernel data structures directly. This was inherited from QNX's procnto-as-kernel-module model and it worked, but it was architecturally wrong: there is no meaningful trust boundary between the kernel and taskman when both run in the same address space under the same privilege level.
v0.40 completes the migration of taskman to U-mode. Taskman now runs as an
ordinary user-space process at address 0xC0000000, with its own page tables,
communicating with the kernel exclusively through the syscall_tm_priv
mechanism. The kernel heap has PTE_U=0. A bug in taskman can no longer
corrupt kernel data by accident.
This took about a week of intensive work spread across versions 0.34 through 0.40. Here is how it went.
v0.34–v0.35 — Groundwork: Relocating Bodies, Retiring __Ring0
The first problem was naming. The historical __Ring0(func_ptr, arg)
primitive — kernel calls a function pointer supplied by taskman, executed with
kernel privilege — was an x86 concept with x86 terminology. RISC-V has no
rings. And beyond the name, the mechanism was architecturally dangerous: the
kernel jumped to whatever address taskman supplied, with no whitelist beyond
a capability flag.
syscall_tm_priv replaces __Ring0. The new mechanism:
<sys/kercalls.h> defines a __KER_TM_* sub-namespace of 53 enum constants.
syscall_tm_priv(id, arg) is a plain ecall that dispatches through a static
kernel-owned table (ker_tm_priv.c), gated on QRV_FLG_PROC_TM_PRIV. The
table is a whitelist. A _Static_assert catches enum/table drift at compile
time. Every EXPORT_SYMBOL(kerext_*) entry removed from the kernel's symbol
table — the kerext bodies are truly kernel-internal now.
This rename was a tree-wide sweep: ~90 call sites, one sed pass.
Phase 0a–0c: kerext bodies moved to kernel. Before taskman text can be
marked PTE_U=1, every function that the kernel calls through a function
pointer must live in kernel text. S-mode instruction fetch from PTE_U=1
pages is illegal regardless of SUM. Three phases relocated the bodies:
- Phase 0a:
kerext_reparent,SignalKill, and the five exit-path bodies (kerext_pulse_deliver,exit_destroy_threads,channel_destroy,timer_destroy,connect_detach) moved tokernel/kext/. - Phase 0b:
sysaddr_map,ext_vaddrinfo,ker_manipulatemoved via a cross-boundary helper table — the bodies called back into taskman helpers (do_manipulation,pte_map) that still lived in taskman text. - Phase 0c: The entire pa allocator cluster (
kerext_pa_alloc,_alloc_given,_free,_free_info) moved alongside a newkerext_mm_xboundarytable registering 10 function pointers to taskman-side helpers.
Phase 0 ends with the condition: every __Ring0 callback body lives in kernel
text. The SPP-flip preconditions are in place.
Phase 1a–1c: arch helpers moved to kernel. pageman_aspace — called on
every context switch via memmgr_p->aspace — moved to
kernel/arch/riscv/pageman_aspace.c. The Sv39 PTE engine
(cpu_pte_manipulate, cpu_pte_merge, cpu_pte_split, sv39_walk,
prot_to_pte_bits) moved to kernel/arch/riscv/cpu_pte.c.
cpu_sysvaddr_find and cpu_pageman_vaddrinfo moved alongside. A kx_*
indirection that was now pointing kernel symbol to kernel symbol was collapsed
to direct calls.
v0.35 — pa.c Migrates to the Kernel
The physical-quantum allocator had always been semantically wrong in taskman: physical memory belongs in the kernel by every microkernel convention. After Phase 0c moved the kerext bodies, moving the allocator itself followed naturally.
kernel/kext/pa.c (~1900 lines) now carries all of it: the global state
(blk_head[], mem_free_size, mem_reserved_size, restriction chain, quantum
pool), the pure helpers (pa_carve, _pa_free, pa_quantum_to_paddr,
enqueue_run, dequeue_run), init code, and six new __KER_TM_PA_* slots.
Taskman's pa.c was replaced with a thin syscall-wrapper file (~380 lines).
The parts that stayed taskman-side: pa_alloc_fake / pa_free_fake (fake
quantum allocator using taskman heap), the high-level pa_alloc retry/purger
loop, and pa_free_info's restriction-filter walk.
Six kernel-data globals that taskman had been reaching into directly
(mem_free_size, blk_head, etc.) now live on the kernel side and are
accessed only through kerext calls. Net leak from kernel data into taskman
binary: zero.
v0.36 — kernel/mm/ Promoted to First Class
The kernel needed its own permanent memory manager — one that lives in S-mode
kernel text and is callable without indirecting through taskman. Pre-v0.36 the
kernel had emm.c ("early memory manager"), a bootstrap stub that handed off
to taskman's pageman table at startup. After the U-mode flip, memmgr_p->FOO
calls could no longer land in taskman text.
kernel/emm.c → kernel/mm/mm.c, symbols emm_* → mm_*. init_memmgr
wires the full table (mmap, munmap, vaddr_to_memobj, vaddrinfo, mcreate,
mdestroy, aspace, pagesize) to kernel-resident symbols:
mm_vaddrinfo(new): read-only Sv39 walk returning the leaf PTE, usingcpu_pte_lookup. Correct and conservative; no memobj/mm_map walk.mm_mcreate/mm_mdestroy(new): process-aspace setup viacpu_pgdir_create. Allocates pgdir +tAddress. Deliberately omits the mm_map list, rwlock, rlimit — taskman owns that bookkeeping.
kerext_register_memmgr and kerext_register_procmgr defanged to no-ops.
The kernel never indirects through taskman text again.
v0.37–v0.38 — First U-Mode Boot and Its Bugs
With the groundwork in place, the Phase 4 work began: actually running taskman in U-mode. This was not a clean single commit — it was a debugging session across multiple evenings, each commit fixing the current crash and revealing the next one.
Phase 4a (v0.34): The predicate for S→U-mode thread entry changed from
!QRV_FLG_PROC_RING0 to !QRV_FLG_PROC_LOADING. The distinction matters:
taskman has RING0 set permanently (it's a privileged process), but its pool
threads should run U-mode. Loader threads — which run in the child process's
context during spawn — should stay S-mode until ProcessStartup promotes them.
Phase 4b: Marking taskman text PTE_U=1 in kerlink. This made taskman
reachable from U-mode but also made it readable from every user process (the
kernel L2 is shared). Acknowledged as a transitional step; Phase 5's ET_DYN
taskman in its own address space fixes the isolation.
The first boot to message_start: Eight orthogonal fixes in one commit
that together carried taskman through all init phases in U-mode:
PTE_U=1 on the kernel heap, SUM set early, a kerext for reading satp from
U-mode, CPIO ramdisk PA → high VA conversion, rlimits initialised for taskman,
PROC_LOADING cleared after the first taskman thread, pool stacks pre-allocated
(96 × 8 KiB via pageman_mmap before thread_pool_start), and
aspace_prp = prp set uniformly for all pool threads.
main() must not return (v0.38). Taskman is now a real ET_EXEC. When
taskman_main returns, libc's crt0 calls exit() → _exit() → _PROC_EXIT
→ procmgr → ProcessThreadsDestroy → tear down all of taskman's threads.
Fix: pthread_exit(NULL) instead of return.
The NOZOMBIE race (v0.38). A subtle spawn-ordering bug: every process is
created with QRV_FLG_PROC_NOZOMBIE set. Before the U-mode flip, proc_start
cleared it for normal spawns. After the flip, kerext_register_procmgr is a
no-op, so procmgr_p->process_start is NULL, and NOZOMBIE stays set
permanently. Every parent's waitpid returns ECHILD immediately. The init
script became fire-and-forget instead of synchronous. Fix: replicate the
NOZOMBIE clearing in proc_loader.c.
mm_map_xfer wired (v0.37). memmgr_p->map_xfer had been NULL since the
kernel/mm/ promotion — left as a stub with an explicit "deferred" comment. The
first time a pool thread's msg_replyv had a cross-aspace destination, the
kernel called NULL. The fault chain was interesting: the NULL jalr → PC=0 →
xfer fault recovery path → _longjmp into a stale jmp_buf whose bytes held
kprintf output → PC = 0xffffffc0000a3036 → second instruction page fault →
"XFER BUG" banner. mm_map_xfer is now a proper Sv39 walk.
LP64 intrinfo truncation (v0.38). init_intrinfo runs before
switch_to_high_va. GCC under -mcmodel=medany resolves symbol addresses
PC-relative at the PA, not the linked VA. The stored mask/unmask function
pointers were PA values (0x80207b88) instead of kernel VAs
(0xffffffc080207b88). Every subsequent ilp->info->unmask() call jumped to
low-half user space, which isn't mapped. Fix: KVA(sym) — OR the symbol
address with KERNEL_VIRT_BASE.
v0.39–v0.40 — Phase 5: The Real Architecture
Phase 5 was the fundamental ABI redesign. All of the Phase 4 work had taskman
running in U-mode while still touching kernel heap data directly (the heap was
PTE_U=1). Phase 5 flips the heap to PTE_U=0 and builds the full structure
to make U-mode taskman correct.
tTMprocess: taskman-side process state. The kernel-allocated tProcess
contains kernel-managed fields. But it also carried POSIX-policy fields
(pgrp, sid, session, umask, root, cwd, guardian, siginfo,
events, resource lists, conf table) that belong in taskman. Under
PTE_U=0, taskman can no longer dereference a tProcess * for any of these.
The solution: taskman/include/tm_process.h defines tTMprocess — a
fixed-size BSS array of TM_PROCESS_MAX slots indexed by PINDEX(pid). Every
POSIX-policy field migrated from kernel-half tProcess to tTMprocess. The
locking state for the address space (rwlock_lock, fault_owner) migrated
from tAddress to tTMprocess. The kernel retains only what it genuinely
owns: scheduling state, IPC vectors, trap/fault state, credentials.
KERNEL_INTERNAL struct gating. Every kernel-private struct body
(tProcess, tThread, tChannel, tConnect, and 15 others) is now gated
behind #ifdef KERNEL_INTERNAL. Kernel sources define KERNEL_INTERNAL and
see full struct bodies. Taskman sees only the typedef pointer-to-incomplete.
Any prp->X deref in taskman code is now a compile error — by design.
pa_pq accessors. struct pa_quantum lives on the kernel heap
(PTE_U=0). Direct field access from U-mode taskman faults. The new
pa_pq_flags() / pa_pq_blk() / pa_pq_run() / pa_pq_modify_flags()
inline wrappers route through __KER_TM_PA_PQ_OP.
80 TM_PRIV slots. The dispatch table grew from 53 to 80 entries during
Phase 5. 25 new slots cover everything taskman needs to do that involves kernel
data: PA_ALLOC_FULL, PA_PQ_OP, ASPACE_MEMCPY, KPAGE_ZERO,
REGISTER_POOL_STACK, PROC_QUERY, PROC_SET_SESSION,
CHANNEL_HAS_RECEIVERS, MMAP_PHYS_USER, MMAP_ANON_USER, and more.
taskman linked as ET_EXEC at 0xC0000000. The old ET_DYN-against-libc.qrl
build required libc.qrl to live in the kernel heap (PTE_U=0 after Phase 5
→ taskman can't reach it). The new build: -nostdlib -static -Ttext-segment=0xC0000000, libc.a inside the link group. Every libc symbol
now lives inside taskman's own image. No DT_NEEDED, no PLT/GOT for libc.
0xC0000000 is Sv39 L2[3]. cpu_pgdir_create copies only L2[1] and
L2[256..511] into every user pgdir. L2[3] stays zero in every user process —
taskman's image is invisible to other processes.
The loader thread problem. During spawn, a loader thread runs in the child
process's context but needs to fetch taskman text. The solution:
mm_taskman_view_install(child_prp) temporarily copies taskman's L1 page
pointer into the child's pgdir at L2[3]. ProcessStartup calls
mm_taskman_view_remove(child_prp) before the first U-mode sret into the
child — at that point the child is fully isolated.
Pool stacks registered by taskman. The kernel's procmgr_stack_alloc
fallback to memmgr_p->mmap was removed — the kernel heap is PTE_U=0,
unusable as a user thread stack. Taskman pre-allocates 96 user-VA stacks via
TaskmanMmapAnonUser and pushes each into the kernel's free list via
__KER_TM_REGISTER_POOL_STACK. The kernel contributes only the free-list
structure; taskman contributes the memory.
826 → 0 warnings. The final commit of the cycle drove the warning count
from 826 to zero. 12 were genuine QRV-proper issues fixed at the root. 814
were -Wconversion in the mksh fork — suppressed at the Makefile level with
an explicit rationale: the mksh fork is diverging and will eventually be
renamed qsh; fixing 800 upstream conversion sites would cost effort that
belongs in the architectural work instead.
What This Means
Taskman is now an ordinary user-space process. It communicates with the kernel
through a defined, whitelist-gated ecall interface. A bug in taskman — a
corrupted pointer, an out-of-bounds write — cannot reach kernel data. The
KERNEL_INTERNAL guard makes this a compile-time property, not just a runtime
one: the compiler rejects any attempt to dereference a kernel struct from
taskman code.
The BKL work and the U-mode work ran on sequential branches. They are now merged into a single tree. The system boots, logs in, and runs on real hardware with both changes in place.
No comments:
Post a Comment