In this blog I share my observations, thoughts and experience about computers, linguistics, philosophy and many other things that interest me.

Thursday, April 23, 2026

QRV v0.24: NVMe Driver, GPT Partitions, and libpci

QRV v0.24 is the NVMe release. From zero to a working block device driver with GPT partition support, developed in one day across four phases. The driver lives entirely in user space, built on top of the QNX resource manager framework and the PCI server that has been accumulating infrastructure since v0.22. The boot sequence now includes:

devb-nvme: GPT OK, 2 partition(s)
  p1: LBA 34..32801 (16 MiB) type=0fc63daf-... name="qrv-test-p1"
  p2: LBA 32802..65569 (16 MiB) type=0fc63daf-... name="qrv-test-p2"
devb-nvme: ready (3 path(s), first NS 64 MiB 512 B/LBA)
br--r--r--  1     0     0 67108864 nvme0n1
br--r--r--  1     0     0 16777216 nvme0n1p1
br--r--r--  1     0     0 16777216 nvme0n1p2

This is read-only for now. Writing to NVMe, mounting a filesystem on a partition, and the Unmatched bring-up are the next steps.


libpci: A Proper Client Library

Before the NVMe driver, the cleanup. Every /dev/pci client — lspci, devb-nvme in its prototype form — was carrying its own MsgSend plumbing, its own pci_cfg_read{8,16,32}, and its own IOM_PCI_ATTACH_DEVICE wrapper. That is the wrong way to grow a driver ecosystem.

lib/libpci pulls all of it up into a shared library with a public API modeled on QNX 8.0's <pci/pci.h>: pci_bdf_t, pci_devhdl_t, pci_ccode_t, and the full pci_device_{attach, detach, find, read_*, cfg_rd*, cfg_wr*, find_capid*, is_multi_func, read_ba, read_irq, write_cmd} surface. The shim underneath still speaks the QNX 6.4 IOM_PCI_* message protocol that pci-server uses.

lspci shrank from 370 lines to 219. The NVMe driver skeleton shrank from 290 lines to 133. Future drivers start with discovery, attach, and config reads already working.

A related fix: pci_device_find() was previously a client-side scan — 256 × 32 × 8 iterations of cfg_rd32, each a MsgSend round-trip to the PCI server, about 80 µs each. Five seconds of wall time just to find (or not find) a device. The server has already enumerated everything at startup and holds the list. pci_device_find() is now a single IOM_PCI_FIND_CLASS message. Boot time dropped noticeably.


The NVMe Driver: Four Phases

Phase A: PCI probe and skeleton

devb-nvme starts by scanning /dev/pci for base class 0x01 / subclass 0x08 (NVMe storage), attaches with PCI_INIT_BASE0 | PCI_MASTER_ENABLE so the server programs the BARs, and reads the NVMe capability registers. This is the scaffolding phase — no queues, no commands, just confirming the controller is present and readable.

emu.sh gets a -nvme flag that attaches a 64 MiB emulated NVMe drive (nvme.img, auto-created on first use). Without the flag, the driver finds nothing and exits cleanly.

Phase B: Controller reset, admin queue, Identify

The controller is reset (CC.EN=0, wait for CSTS.RDY=0), an admin submission and completion queue pair is allocated with MAP_ANON and their physical addresses discovered via mem_offset(), and the controller is re-enabled (program AQA/ASQ/ACQ, set CC, wait for CSTS.RDY=1).

Three Identify commands enumerate the namespaces: Identify Controller (CNS 0x01), Identify Active Namespace List (CNS 0x02), and Identify Namespace (CNS 0x00) for each active NSID.

One bug was caught here that is worth noting explicitly. Each Identify call was allocating a fresh MAP_ANON page and munmap()ing it between calls. The second and third Identify results came back as garbage — non-deterministic garbage, the kind that "fixing" by inserting a debug print is the first warning sign of a memory ordering issue. The underlying problem: physical page reuse with stale cache state, combined with the fact that RISC-V plain loads are not ordered and volatile only blocks the compiler. Switching to a single persistent Identify buffer (one mmap at init, zeroed before each use, never unmapped) produced 8/8 clean runs.

Phase C: I/O queue and first Read

An I/O queue pair is created at qid=1 via the admin Create-IO-CQ and Create-IO-SQ commands. A 1-LBA Read against LBA 0 of the first active namespace is issued and the result hexdumped.

With a recognizable pattern planted in nvme.img:

devb-nvme: Read nsid=1 LBA=0 (512 bytes) OK
  0000: 51 52 56 2d 4f 53 20 4e 56 4d 65 20 74 65 73 74  |QRV-OS NVMe test|
  0010: 20 70 61 74 74 65 72 6e 3a 20 4c 42 41 20 30 20  | pattern: LBA 0 |

A second ordering bug surfaced here. The polling path read cqe->status, checked the phase bit, and then read cqe->cid and the PRP1 payload buffer. On x86 this is fine because loads are ordered. On RISC-V they are not — the CPU is free to reorder the payload read before the phase check. The symptom was non-deterministic garbage in consecutive Identify buffers, with each run corrupting differently. The fix: __atomic_thread_fence(ACQUIRE) immediately after the phase check. Five clean runs afterwards.

This is a class of bug that trips up essentially everyone porting code from x86 to a weakly-ordered architecture. The code was correct for the hardware it was written on. RISC-V doesn't make the same promises, and neither volatile nor careful reading of generated assembly makes it obvious.

Phase D: Resource manager, /dev/nvme0n1

The driver is promoted from a probe tool to a resident daemon. After controller bring-up, it registers /dev/nvme0n1 as a read-only block-device resource manager and enters the dispatch loop.

The read path: io_read computes the LBA span covering the requested byte range, issues nvme_cmd_read() into a per-driver 4 KiB bounce buffer, and MsgReplys the requested slice. One 4 KiB page per call so PRP1 covers the entire transfer.

One sharp edge caught during bring-up: dispatch_create must come before any /dev/pci traffic. The first attempted resmgr_attach(/dev/nvme0n1) after a successful pci_device_attach() failed with errno = -EBUSY. Creating the dispatch — and calling the "set up internals" resmgr_attach(path=NULL) — at the very top of main(), before any PCI messages, makes the later path registration succeed 5/5. This is almost certainly a channel/connection slot ordering issue inside libc's dispatch and message plumbing. Documented here for the next driver to hit it.

Phase 3e: GPT partitions

After registering /dev/nvme0n1 for the whole namespace, the driver walks the GPT via libgpt and registers one resource manager path per partition. Each partition node gets its own nvme_region_t (a iofunc_attr_t subclass) carrying start_lba, nlba, nsid, and size_bytes. Every handler recovers the region from ocb->attr and enforces partition boundaries — a partition node cannot read outside its own range.

If there is no GPT, a bad CRC, or a hybrid MBR, the driver falls back to raw /dev/nvme0n1 only. This is non-fatal and produces a diagnostic.


libgpt

lib/libgpt is a pure GPT parser. The caller supplies a read-LBA callback; the library does no I/O of its own. It validates the protective MBR, GPT header signature and revision, header CRC32, and partition-array CRC32. UTF-16LE partition names are unpacked to UTF-8. Plain MBR and hybrid GPT are rejected outright — a hybrid GPT (MBR carrying partitions other than a single 0xEE protective entry) is explicitly not supported and the library says so clearly rather than guessing.

A host-side self-test (make test in lib/libgpt) exercises the valid path and four rejection paths: plain MBR, hybrid MBR, bad header CRC, bad entry-array CRC.

scripts/mkgpt.py is a pure-Python host tool that writes a valid GPT to an image file — protective MBR, primary header, primary entry array (128 × 128 bytes), using the Linux filesystem data type GUID. The emu.sh -nvme path uses it to lay out two 16 MiB test partitions on first run.


PCI Discovery via Syspage

pci-server was using a hardcoded platforms[] table to know where the ECAM region lives on supported boards. That is the wrong architecture for a system that is supposed to be device-tree-driven.

The kernel's FDT parser now publishes the PCI host bridge to hwinfo as an HWI_ITEM_BUS_PCI bus with location, IRQ, and PCI window tags (memory, IO, prefetchable, all with correct flags). pci-server discovers the bridge via hwinfo_find_bus(). The hardcoded table is gone. If the kernel didn't publish a PCIe bus, pci-server fails with ENODEV rather than silently defaulting to a QEMU virt address.


What Comes Next

/dev/nvme0n1 and /dev/nvme0n1p1/2 exist, are readable, and have correct sizes. The next step is mounting fs-qrv on a partition instead of on the virtio block device — and then bringing this up on the Unmatched, which has a real M.2 NVMe slot and a PCIe bus waiting.

No comments: