QRV v0.24 is the NVMe release. From zero to a working block device driver with GPT partition support, developed in one day across four phases. The driver lives entirely in user space, built on top of the QNX resource manager framework and the PCI server that has been accumulating infrastructure since v0.22. The boot sequence now includes:
devb-nvme: GPT OK, 2 partition(s)
p1: LBA 34..32801 (16 MiB) type=0fc63daf-... name="qrv-test-p1"
p2: LBA 32802..65569 (16 MiB) type=0fc63daf-... name="qrv-test-p2"
devb-nvme: ready (3 path(s), first NS 64 MiB 512 B/LBA)
br--r--r-- 1 0 0 67108864 nvme0n1
br--r--r-- 1 0 0 16777216 nvme0n1p1
br--r--r-- 1 0 0 16777216 nvme0n1p2
This is read-only for now. Writing to NVMe, mounting a filesystem on a partition, and the Unmatched bring-up are the next steps.
libpci: A Proper Client Library
Before the NVMe driver, the cleanup. Every /dev/pci client — lspci,
devb-nvme in its prototype form — was carrying its own MsgSend plumbing,
its own pci_cfg_read{8,16,32}, and its own IOM_PCI_ATTACH_DEVICE
wrapper. That is the wrong way to grow a driver ecosystem.
lib/libpci pulls all of it up into a shared library with a public API
modeled on QNX 8.0's <pci/pci.h>: pci_bdf_t, pci_devhdl_t,
pci_ccode_t, and the full pci_device_{attach, detach, find, read_*, cfg_rd*, cfg_wr*, find_capid*, is_multi_func, read_ba, read_irq, write_cmd} surface. The
shim underneath still speaks the QNX 6.4 IOM_PCI_* message protocol that
pci-server uses.
lspci shrank from 370 lines to 219. The NVMe driver skeleton shrank from
290 lines to 133. Future drivers start with discovery, attach, and config
reads already working.
A related fix: pci_device_find() was previously a client-side scan —
256 × 32 × 8 iterations of cfg_rd32, each a MsgSend round-trip to the
PCI server, about 80 µs each. Five seconds of wall time just to find (or
not find) a device. The server has already enumerated everything at startup
and holds the list. pci_device_find() is now a single
IOM_PCI_FIND_CLASS message. Boot time dropped noticeably.
The NVMe Driver: Four Phases
Phase A: PCI probe and skeleton
devb-nvme starts by scanning /dev/pci for base class 0x01 / subclass
0x08 (NVMe storage), attaches with PCI_INIT_BASE0 | PCI_MASTER_ENABLE so
the server programs the BARs, and reads the NVMe capability registers.
This is the scaffolding phase — no queues, no commands, just confirming the
controller is present and readable.
emu.sh gets a -nvme flag that attaches a 64 MiB emulated NVMe drive
(nvme.img, auto-created on first use). Without the flag, the driver finds
nothing and exits cleanly.
Phase B: Controller reset, admin queue, Identify
The controller is reset (CC.EN=0, wait for CSTS.RDY=0), an admin submission
and completion queue pair is allocated with MAP_ANON and their physical
addresses discovered via mem_offset(), and the controller is re-enabled
(program AQA/ASQ/ACQ, set CC, wait for CSTS.RDY=1).
Three Identify commands enumerate the namespaces: Identify Controller (CNS 0x01), Identify Active Namespace List (CNS 0x02), and Identify Namespace (CNS 0x00) for each active NSID.
One bug was caught here that is worth noting explicitly. Each Identify call
was allocating a fresh MAP_ANON page and munmap()ing it between calls.
The second and third Identify results came back as garbage — non-deterministic
garbage, the kind that "fixing" by inserting a debug print is the first
warning sign of a memory ordering issue. The underlying problem: physical
page reuse with stale cache state, combined with the fact that RISC-V plain
loads are not ordered and volatile only blocks the compiler. Switching to
a single persistent Identify buffer (one mmap at init, zeroed before each
use, never unmapped) produced 8/8 clean runs.
Phase C: I/O queue and first Read
An I/O queue pair is created at qid=1 via the admin Create-IO-CQ and Create-IO-SQ commands. A 1-LBA Read against LBA 0 of the first active namespace is issued and the result hexdumped.
With a recognizable pattern planted in nvme.img:
devb-nvme: Read nsid=1 LBA=0 (512 bytes) OK
0000: 51 52 56 2d 4f 53 20 4e 56 4d 65 20 74 65 73 74 |QRV-OS NVMe test|
0010: 20 70 61 74 74 65 72 6e 3a 20 4c 42 41 20 30 20 | pattern: LBA 0 |
A second ordering bug surfaced here. The polling path read cqe->status,
checked the phase bit, and then read cqe->cid and the PRP1 payload
buffer. On x86 this is fine because loads are ordered. On RISC-V they are
not — the CPU is free to reorder the payload read before the phase check.
The symptom was non-deterministic garbage in consecutive Identify buffers,
with each run corrupting differently. The fix: __atomic_thread_fence(ACQUIRE)
immediately after the phase check. Five clean runs afterwards.
This is a class of bug that trips up essentially everyone porting code from
x86 to a weakly-ordered architecture. The code was correct for the hardware
it was written on. RISC-V doesn't make the same promises, and neither
volatile nor careful reading of generated assembly makes it obvious.
Phase D: Resource manager, /dev/nvme0n1
The driver is promoted from a probe tool to a resident daemon. After
controller bring-up, it registers /dev/nvme0n1 as a read-only
block-device resource manager and enters the dispatch loop.
The read path: io_read computes the LBA span covering the requested byte
range, issues nvme_cmd_read() into a per-driver 4 KiB bounce buffer, and
MsgReplys the requested slice. One 4 KiB page per call so PRP1 covers
the entire transfer.
One sharp edge caught during bring-up: dispatch_create must come before
any /dev/pci traffic. The first attempted resmgr_attach(/dev/nvme0n1)
after a successful pci_device_attach() failed with errno = -EBUSY.
Creating the dispatch — and calling the "set up internals"
resmgr_attach(path=NULL) — at the very top of main(), before any PCI
messages, makes the later path registration succeed 5/5. This is almost
certainly a channel/connection slot ordering issue inside libc's dispatch
and message plumbing. Documented here for the next driver to hit it.
Phase 3e: GPT partitions
After registering /dev/nvme0n1 for the whole namespace, the driver walks
the GPT via libgpt and registers one resource manager path per partition.
Each partition node gets its own nvme_region_t (a iofunc_attr_t
subclass) carrying start_lba, nlba, nsid, and size_bytes. Every
handler recovers the region from ocb->attr and enforces partition
boundaries — a partition node cannot read outside its own range.
If there is no GPT, a bad CRC, or a hybrid MBR, the driver falls back to
raw /dev/nvme0n1 only. This is non-fatal and produces a diagnostic.
libgpt
lib/libgpt is a pure GPT parser. The caller supplies a read-LBA callback;
the library does no I/O of its own. It validates the protective MBR, GPT
header signature and revision, header CRC32, and partition-array CRC32.
UTF-16LE partition names are unpacked to UTF-8. Plain MBR and hybrid GPT
are rejected outright — a hybrid GPT (MBR carrying partitions other than
a single 0xEE protective entry) is explicitly not supported and the library
says so clearly rather than guessing.
A host-side self-test (make test in lib/libgpt) exercises the valid
path and four rejection paths: plain MBR, hybrid MBR, bad header CRC, bad
entry-array CRC.
scripts/mkgpt.py is a pure-Python host tool that writes a valid GPT to an
image file — protective MBR, primary header, primary entry array (128 × 128
bytes), using the Linux filesystem data type GUID. The emu.sh -nvme path
uses it to lay out two 16 MiB test partitions on first run.
PCI Discovery via Syspage
pci-server was using a hardcoded platforms[] table to know where the
ECAM region lives on supported boards. That is the wrong architecture for a
system that is supposed to be device-tree-driven.
The kernel's FDT parser now publishes the PCI host bridge to hwinfo as an
HWI_ITEM_BUS_PCI bus with location, IRQ, and PCI window tags (memory, IO,
prefetchable, all with correct flags). pci-server discovers the bridge via
hwinfo_find_bus(). The hardcoded table is gone. If the kernel didn't
publish a PCIe bus, pci-server fails with ENODEV rather than silently
defaulting to a QEMU virt address.
What Comes Next
/dev/nvme0n1 and /dev/nvme0n1p1/2 exist, are readable, and have correct
sizes. The next step is mounting fs-qrv on a partition instead of on the
virtio block device — and then bringing this up on the Unmatched, which has
a real M.2 NVMe slot and a PCIe bus waiting.
No comments:
Post a Comment