That mount command works. On QEMU virt, a qrvfs filesystem written by the host
mkfs-qrv tool to NVMe partition 8 is mounted at /disk2, browsable with
ls, readable with cat, with all four resource managers — devb-nvme,
fs-qrv on /dev/nvme0n1p8, devb-virtio, fs-qrv on /dev/vblk0 —
running simultaneously in user space.
On the SiFive HiFive Unmatched, the DesignWare PCIe controller enumerates
the full board topology: USB hub, Samsung MZVL2512HCJQ NVMe
(144d:a80a), NVIDIA VGA, NVIDIA audio. devb-nvme reads the Samsung's
GPT — eight partitions, ~2 GiB at partition 8 — and registers
/dev/nvme0n1 and /dev/nvme0n1p8 in
the QRV namespace.
That last partition has been sitting empty since July 2021, when the board arrived and a deliberate choice was made to leave space for an operating system that didn't exist yet. Today it does.
The sync_hash Collision
Before any of the above was possible, a fundamental kernel bug had to be
found. The symptom: the second instance of any resmgr binary — fs-qrv
spawned for /dev/nvme0n1p5 while another fs-qrv was already mounted on
/dev/vblk0 — failed with EBUSY on pthread_mutex_init during
resmgr_attach. The two processes were otherwise healthy. Killing one let
the other proceed. Running them sequentially worked fine.
The root cause is in RISC-V's cpu_pageman_vaddrinfo. Rather than walking
the user page tables to find the physical address backing a virtual address,
it does a lexical subtraction: PA = VA - KERNEL_VIRT_BASE. For kernel
addresses this is an identity operation — every kernel VA maps to exactly
one PA. For user addresses from processes loaded at the same virtual base
(no ASLR, same ET_DYN layout), two processes end up at identical virtual
addresses for their arena, and the lexical subtraction produces identical
"physical" addresses for both.
The kernel's sync_hash uses the physical address as the key. Two
processes calling pthread_mutex_init at the same user VA produce the same
(obj=_syspage_ptr, offset=VA-BIAS) key. The second one collides with the
first's entry and gets EBUSY.
On x86_64 this was benign because cpu_pageman_vaddrinfo walks the actual
page tables, so two processes at the same VA get genuinely different PAs.
The bug was a RISC-V-specific shortcut that was harmless when only one
instance of each resmgr ran, and fatal as soon as two instances of the same
binary ran simultaneously.
Fix: in pageman_vaddr_to_memobj and pageman_munmap, return the calling
tProcess * as the hash object for private mutexes, making the key
(prp, offset) instead of (_syspage_ptr, offset). Per-process keys
cannot collide across processes by construction.
The DesignWare iATU
Getting PCI enumeration to work on the Unmatched required understanding something non-obvious about the FU740's PCIe controller.
QEMU's pcie-ecam-generic host bridge provides a flat ECAM memory window:
a contiguous region where the address encodes bus<<20 | slot<<15 | func<<12 | register. You mmap it once and read any device's config space
by computing an offset. Straightforward.
The SiFive FU740 uses a Synopsys DesignWare PCIe controller. Its "config"
memory window is not flat ECAM. The DW iATU encodes CFG TLPs as
bus<<24 | slot<<19 | func<<16 | register — a different bit layout,
wider per-field. A single iATU mapping cannot cover the whole bus space.
The canonical pattern (FreeBSD pci_dw_read_config, Linux
dw_pcie_other_conf_map_bus) is to re-aim outbound iATU region 0 to the
specific (bus, slot, func) before each config access.
New file servers/pci/pci_dw_atu.c implements this:
pci_dw_atu_init(dbi_pa)mmaps 4 MiB of the DBI register space and auto-detects whether the controller uses legacy iATU (at DBI+0x900) or unroll mode (at DBI+0x300000) by readingDW_IATU_VIEWPORT— it returns0xFFFFFFFFin unroll mode.pci_dw_atu_aim_cfg(bus, slot, func, type, cfg_pa, cfg_size)programs region 0 with TYPE_CFG0 (immediate child, bus 1) or TYPE_CFG1 (downstream of bridge, bus ≥ 2), pollsCTRL2.REGION_ENfor completion.- Bus 0 (the controller's own pseudo-bridge) reads directly from DBI;
non-zero devfunc on bus 0 returns
0xFFFFFFFF(no device).
The FDT parser was also reworked: the FU740's device tree uses reg-names
and interrupt-names to distinguish the DBI window from the ECAM/config
window and MSI interrupt. Previously the parser read them in order and
assigned ecam_base to whichever reg entry came first — on the FU740
that is the DBI window (wrong), not the config window. The fix makes the
commit block idempotent and re-reads on the second pass once reg-names
is available, correcting ecam_base to reg[1].
On the Unmatched: unroll mode detected, full PCIe topology enumerated.
PCIe INTx: Three Bugs in One Session
Getting interrupt-driven NVMe working on QEMU required fixing three latent
bugs in the PCIe INTx allocation path, all of which meant pci_device_read_irq
returned 0.
Bug 1: libpci's pci_device_attach was missing PCI_INIT_IRQ from
the attach flags sent to pci-server. Without it, the server's entire
IRQ-allocation path was skipped unconditionally.
Bug 2: The server's IRQ resource pool was never seeded on RISC-V. The
original QNX flow relied on BIOS-programmed Interrupt_Line registers that
pci_enum could pci_reserve_irq() from. QEMU virt leaves Interrupt_Line
at 0xFF. Nothing got reserved; every subsequent rsrcdbmgr_attach call
failed. Fixed by seeding INTA..INTD (irq_base..irq_base+3) at
ecam_attach() time.
Bug 3: ecam_avail_irq() returned the full unswizzled [INTA, INTB, INTC, INTD] list, so pci_alloc_irq picked the numerically lowest free
IRQ rather than the pin-routed one. The correct result for a PCIe device
is a single wire: irq_base + ((device + pin - 1) % 4). Read
Interrupt_Pin from config space and return that one.
After all three: pci_device_read_irq on the QEMU NVMe at 00:02.0 with
pin A returns PLIC IRQ 34 — the wire the device actually asserts.
devb-nvme then switched from polling to proper IST-driven command
completion: a dedicated IST thread drains both admin and I/O CQs on each
wake (INTx is shared between queues), latches (status, cid) into a per-
queue mutex/condvar/done gate, signals the submitter, and unmasks the
kernel vector. The polling path remains as fallback for admin commands
during bringup before the IST starts.
The Synopsys DW MSI Controller (Stages 1–3)
For MSI-X support — which NVMe devices strongly prefer over INTx — the
FU740's internal Synopsys DesignWare MSI controller needs to be brought up.
It receives posted MSI writes from devices, sets bits in MSI_INTR_STATUS_0,
and asserts one aggregate PLIC IRQ (vector 56 on the Unmatched).
Three stages landed this release:
Stage 1: FDT parsing of the DBI window and MSI IRQ number. The kernel
publishes a dw_pcie_msi child device under the PCI bus, carrying the DBI
window location and aggregate PLIC IRQ. On QEMU virt nothing publishes this;
the existing INTx path is completely unchanged.
Stage 2: ecam.c consumes the child device when present, picking up the
DBI window address and MSI IRQ for later use.
Stage 3: pci_dwmsi.c — the controller bringup driver. Maps 4 KiB of
DBI (the DTB declares a 2 GiB window; mapping it in full would consume
pci-server's entire address space, so it is capped), allocates an
anonymous page as the MSI target, programs ADDR_LO/ADDR_HI/
INTR_ENABLE/STATUS, and attaches an IST to the aggregate PLIC IRQ. Each
MSI fire is logged. Per-vector pulse dispatch to waiting drivers is Stage 4,
deferred.
Two Other Bugs Worth Noting
Recursive kmutex initialization. mutex_init() with
PTHREAD_RECURSIVE_ENABLE was using KMUTEX_RMUTEX_INIT, which sets
owner to QRV_SYNC_INITIALIZER (0xffffffff). That sentinel is for
statically-initialized user-space pthread mutexes; kernel mutexes never
go through the sync subsystem and their mutex_lock() spins waiting for
owner==0. 0xffffffff never becomes 0. Any lock of a recursively-
initialized kmutex spun forever.
Separately, taskman/sys/support.c's per-process lock was being created
with NULL attr (non-recursive + error-check), while reentrant taskman
paths that take the same process's mutex twice on one thread —
QueryObject → handler → proc_lock_pid(same pid) — hit EDEADLK and
crashed. Both fixed; together they were taking down taskman on the
Unmatched during pci-server startup.
vfprintf NUL-byte truncation. PRINT(ox, 2) and PRINT(&sign, 1)
stash pointers to stack-local variables into the IOV vector for deferred
__sfvwrite consumption. Both locals get reset to '\0' at the top of
every new format specifier. When one format spec's PRINT'd pointer was
still queued and the next spec's reset fired first, the pointed-to bytes
became NUL before __sfvwrite read them — emitting a stray NUL
mid-output that truncated any consumer reading the buffer as a C string.
The symptom: pci_slogf("%#lx PLIC IRQ %u", ...) on the Unmatched
produced "0xe00000000/0" — the 0x prefix of the second %#lx became
"0" + NUL when the trailing %u iteration wiped ox[1] before the
flush. Fix: drain the IOV via FLUSH() at the end of the pforw block
so the locals are consumed before the next iteration resets them. Thirteen
new printf regression tests, all passing.
The Test Suite
61 tests across three subsystems now run on every build: 25 for
TimerCreate/TimerDestroy, 23 for the sync subsystem (mutexes,
condvars, semaphores — including error paths, resource exhaustion, and
object reuse), and 13 for printf (including the %#lx+%u regression
that caught the NUL bug). All 61 pass on QEMU virt.
The Partition That Waited
The CHANGELOG notes it plainly: devb-nvme on the Unmatched read the
Samsung's GPT and found 8 partitions.
That last partition was formatted and left empty in July 2021 when the board
first arrived. The intent, at the time, was to eventually fill it with a
custom filesystem on a QNX-compatible microkernel ported to RISC-V.
Four years and a few months later, mount -t qrv /dev/nvme0n1p8 /disk2
is a working command.
No comments:
Post a Comment