In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Saturday, April 25, 2026

QRV v0.25: mount -t qrv /dev/nvme0n1p8 /disk2

That mount command works. On QEMU virt, a qrvfs filesystem written by the host mkfs-qrv tool to NVMe partition 8 is mounted at /disk2, browsable with ls, readable with cat, with all four resource managers — devb-nvme, fs-qrv on /dev/nvme0n1p8, devb-virtio, fs-qrv on /dev/vblk0 — running simultaneously in user space.

On the SiFive HiFive Unmatched, the DesignWare PCIe controller enumerates the full board topology: USB hub, Samsung MZVL2512HCJQ NVMe (144d:a80a), NVIDIA VGA, NVIDIA audio. devb-nvme reads the Samsung's GPT — eight partitions, ~2 GiB at partition 8 — and registers /dev/nvme0n1 and /dev/nvme0n1p8 in the QRV namespace.

That last partition has been sitting empty since July 2021, when the board arrived and a deliberate choice was made to leave space for an operating system that didn't exist yet. Today it does.


The sync_hash Collision

Before any of the above was possible, a fundamental kernel bug had to be found. The symptom: the second instance of any resmgr binary — fs-qrv spawned for /dev/nvme0n1p5 while another fs-qrv was already mounted on /dev/vblk0 — failed with EBUSY on pthread_mutex_init during resmgr_attach. The two processes were otherwise healthy. Killing one let the other proceed. Running them sequentially worked fine.

The root cause is in RISC-V's cpu_pageman_vaddrinfo. Rather than walking the user page tables to find the physical address backing a virtual address, it does a lexical subtraction: PA = VA - KERNEL_VIRT_BASE. For kernel addresses this is an identity operation — every kernel VA maps to exactly one PA. For user addresses from processes loaded at the same virtual base (no ASLR, same ET_DYN layout), two processes end up at identical virtual addresses for their arena, and the lexical subtraction produces identical "physical" addresses for both.

The kernel's sync_hash uses the physical address as the key. Two processes calling pthread_mutex_init at the same user VA produce the same (obj=_syspage_ptr, offset=VA-BIAS) key. The second one collides with the first's entry and gets EBUSY.

On x86_64 this was benign because cpu_pageman_vaddrinfo walks the actual page tables, so two processes at the same VA get genuinely different PAs. The bug was a RISC-V-specific shortcut that was harmless when only one instance of each resmgr ran, and fatal as soon as two instances of the same binary ran simultaneously.

Fix: in pageman_vaddr_to_memobj and pageman_munmap, return the calling tProcess * as the hash object for private mutexes, making the key (prp, offset) instead of (_syspage_ptr, offset). Per-process keys cannot collide across processes by construction.


The DesignWare iATU

Getting PCI enumeration to work on the Unmatched required understanding something non-obvious about the FU740's PCIe controller.

QEMU's pcie-ecam-generic host bridge provides a flat ECAM memory window: a contiguous region where the address encodes bus<<20 | slot<<15 | func<<12 | register. You mmap it once and read any device's config space by computing an offset. Straightforward.

The SiFive FU740 uses a Synopsys DesignWare PCIe controller. Its "config" memory window is not flat ECAM. The DW iATU encodes CFG TLPs as bus<<24 | slot<<19 | func<<16 | register — a different bit layout, wider per-field. A single iATU mapping cannot cover the whole bus space. The canonical pattern (FreeBSD pci_dw_read_config, Linux dw_pcie_other_conf_map_bus) is to re-aim outbound iATU region 0 to the specific (bus, slot, func) before each config access.

New file servers/pci/pci_dw_atu.c implements this:

  • pci_dw_atu_init(dbi_pa) mmaps 4 MiB of the DBI register space and auto-detects whether the controller uses legacy iATU (at DBI+0x900) or unroll mode (at DBI+0x300000) by reading DW_IATU_VIEWPORT — it returns 0xFFFFFFFF in unroll mode.
  • pci_dw_atu_aim_cfg(bus, slot, func, type, cfg_pa, cfg_size) programs region 0 with TYPE_CFG0 (immediate child, bus 1) or TYPE_CFG1 (downstream of bridge, bus ≥ 2), polls CTRL2.REGION_EN for completion.
  • Bus 0 (the controller's own pseudo-bridge) reads directly from DBI; non-zero devfunc on bus 0 returns 0xFFFFFFFF (no device).

The FDT parser was also reworked: the FU740's device tree uses reg-names and interrupt-names to distinguish the DBI window from the ECAM/config window and MSI interrupt. Previously the parser read them in order and assigned ecam_base to whichever reg entry came first — on the FU740 that is the DBI window (wrong), not the config window. The fix makes the commit block idempotent and re-reads on the second pass once reg-names is available, correcting ecam_base to reg[1].

On the Unmatched: unroll mode detected, full PCIe topology enumerated.


PCIe INTx: Three Bugs in One Session

Getting interrupt-driven NVMe working on QEMU required fixing three latent bugs in the PCIe INTx allocation path, all of which meant pci_device_read_irq returned 0.

Bug 1: libpci's pci_device_attach was missing PCI_INIT_IRQ from the attach flags sent to pci-server. Without it, the server's entire IRQ-allocation path was skipped unconditionally.

Bug 2: The server's IRQ resource pool was never seeded on RISC-V. The original QNX flow relied on BIOS-programmed Interrupt_Line registers that pci_enum could pci_reserve_irq() from. QEMU virt leaves Interrupt_Line at 0xFF. Nothing got reserved; every subsequent rsrcdbmgr_attach call failed. Fixed by seeding INTA..INTD (irq_base..irq_base+3) at ecam_attach() time.

Bug 3: ecam_avail_irq() returned the full unswizzled [INTA, INTB, INTC, INTD] list, so pci_alloc_irq picked the numerically lowest free IRQ rather than the pin-routed one. The correct result for a PCIe device is a single wire: irq_base + ((device + pin - 1) % 4). Read Interrupt_Pin from config space and return that one.

After all three: pci_device_read_irq on the QEMU NVMe at 00:02.0 with pin A returns PLIC IRQ 34 — the wire the device actually asserts.

devb-nvme then switched from polling to proper IST-driven command completion: a dedicated IST thread drains both admin and I/O CQs on each wake (INTx is shared between queues), latches (status, cid) into a per- queue mutex/condvar/done gate, signals the submitter, and unmasks the kernel vector. The polling path remains as fallback for admin commands during bringup before the IST starts.


The Synopsys DW MSI Controller (Stages 1–3)

For MSI-X support — which NVMe devices strongly prefer over INTx — the FU740's internal Synopsys DesignWare MSI controller needs to be brought up. It receives posted MSI writes from devices, sets bits in MSI_INTR_STATUS_0, and asserts one aggregate PLIC IRQ (vector 56 on the Unmatched).

Three stages landed this release:

Stage 1: FDT parsing of the DBI window and MSI IRQ number. The kernel publishes a dw_pcie_msi child device under the PCI bus, carrying the DBI window location and aggregate PLIC IRQ. On QEMU virt nothing publishes this; the existing INTx path is completely unchanged.

Stage 2: ecam.c consumes the child device when present, picking up the DBI window address and MSI IRQ for later use.

Stage 3: pci_dwmsi.c — the controller bringup driver. Maps 4 KiB of DBI (the DTB declares a 2 GiB window; mapping it in full would consume pci-server's entire address space, so it is capped), allocates an anonymous page as the MSI target, programs ADDR_LO/ADDR_HI/ INTR_ENABLE/STATUS, and attaches an IST to the aggregate PLIC IRQ. Each MSI fire is logged. Per-vector pulse dispatch to waiting drivers is Stage 4, deferred.


Two Other Bugs Worth Noting

Recursive kmutex initialization. mutex_init() with PTHREAD_RECURSIVE_ENABLE was using KMUTEX_RMUTEX_INIT, which sets owner to QRV_SYNC_INITIALIZER (0xffffffff). That sentinel is for statically-initialized user-space pthread mutexes; kernel mutexes never go through the sync subsystem and their mutex_lock() spins waiting for owner==0. 0xffffffff never becomes 0. Any lock of a recursively- initialized kmutex spun forever.

Separately, taskman/sys/support.c's per-process lock was being created with NULL attr (non-recursive + error-check), while reentrant taskman paths that take the same process's mutex twice on one thread — QueryObject → handler → proc_lock_pid(same pid) — hit EDEADLK and crashed. Both fixed; together they were taking down taskman on the Unmatched during pci-server startup.

vfprintf NUL-byte truncation. PRINT(ox, 2) and PRINT(&sign, 1) stash pointers to stack-local variables into the IOV vector for deferred __sfvwrite consumption. Both locals get reset to '\0' at the top of every new format specifier. When one format spec's PRINT'd pointer was still queued and the next spec's reset fired first, the pointed-to bytes became NUL before __sfvwrite read them — emitting a stray NUL mid-output that truncated any consumer reading the buffer as a C string.

The symptom: pci_slogf("%#lx PLIC IRQ %u", ...) on the Unmatched produced "0xe00000000/0" — the 0x prefix of the second %#lx became "0" + NUL when the trailing %u iteration wiped ox[1] before the flush. Fix: drain the IOV via FLUSH() at the end of the pforw block so the locals are consumed before the next iteration resets them. Thirteen new printf regression tests, all passing.


The Test Suite

61 tests across three subsystems now run on every build: 25 for TimerCreate/TimerDestroy, 23 for the sync subsystem (mutexes, condvars, semaphores — including error paths, resource exhaustion, and object reuse), and 13 for printf (including the %#lx+%u regression that caught the NUL bug). All 61 pass on QEMU virt.


The Partition That Waited

The CHANGELOG notes it plainly: devb-nvme on the Unmatched read the Samsung's GPT and found 8 partitions.

That last partition was formatted and left empty in July 2021 when the board first arrived. The intent, at the time, was to eventually fill it with a custom filesystem on a QNX-compatible microkernel ported to RISC-V. Four years and a few months later, mount -t qrv /dev/nvme0n1p8 /disk2 is a working command.

No comments: