Today marks a significant milestone for the GMS/359 project: the successful integration and verification of the GMS 2950 Cryptographic Processor — a modular exponentiation accelerator connected to our Multiplexor Channel.
The Achievement
GMS 2950 Crypto Test
====================
Test: 3^5 mod 7 = ?
Computing...
SUCCESS! Z = 0x05
That humble 0x05 represents something remarkable: our little FPGA mainframe just computed its first cryptographic operation. The number 243 (which is 3⁵) modulo 7 equals 5 — verified in hardware, orchestrated through authentic System/360-style Channel I/O.
Why Modular Exponentiation?
The operation z = y^x mod m is the mathematical foundation of modern public-key cryptography:
- RSA encryption and decryption
- Diffie-Hellman key exchange
- Digital signatures
IBM didn't add dedicated crypto acceleration to their mainframes until the z990 in 2003. Our 1965-era architecture recreation now has capabilities that Big Iron took four decades to acquire!
Architecture
The GMS 2950 connects to the Multiplexor Channel as device 0x2A:
┌─────────────────────────────────────────────────────┐
│ 2870 MULTIPLEXOR CHANNEL │
├─────────────────────────────────────────────────────┤
│ 10h Video Controller ──── VGA Output │
│ 11h Keyboard Controller ──── PS/2 Input │
│ 12h UART ──── Serial Console │
│ 2Ah Crypto Processor ──── NEW! │
│ 2Bh SYSINFO ──── System Information │
│ 2Eh Console ──── Console Area │
└─────────────────────────────────────────────────────┘
The programming model follows our Channel I/O conventions:
WRITE 25 bytes to load operands and start computation:
Bytes 0-7: X (exponent) - 64-bit little-endian
Bytes 8-15: Y (base) - 64-bit little-endian
Bytes 16-23: M (modulus) - 64-bit little-endian
Byte 24: CONTROL - bit 0 = START
READ 8 bytes to retrieve the result:
Bytes 0-7: Z (result) - 64-bit little-endian
The beauty of Channel I/O shines here: the CPU issues SIO 02Ah, and the channel handles all the byte-by-byte transfers autonomously. The CPU can poll with TIO or (eventually) receive an interrupt when computation completes.
The Test Program
; Load operands and start crypto
LFI R1, crypto_write_ccw
ST [tMSVA.CAW], R1
SIO 02Ah ; Start WRITE to crypto
poll_write:
TIO 02Ah
BB poll_write ; Wait for completion
; ... computation happens in hardware ...
; Read result
LFI R1, crypto_read_ccw
ST [tMSVA.CAW], R1
SIO 02Ah ; Start READ from crypto
poll_read:
TIO 02Ah
BB poll_read
; Check result
LFI R2, result_buffer
LB R3, [R2] ; Load first byte of result
LFI R4, 5 ; Expected value
CMP R3, R4
BNE test_failed ; Branch if not equal
; SUCCESS!
The CCW (Channel Command Word) setup is straightforward:
crypto_write_ccw:
DB 01h ; WRITE command
DB 00h ; No flags
DW 25 ; 25 bytes
DD crypto_operands ; Source address
crypto_operands:
DB 05h, 00h, 00h, 00h, 00h, 00h, 00h, 00h ; X = 5
DB 03h, 00h, 00h, 00h, 00h, 00h, 00h, 00h ; Y = 3
DB 07h, 00h, 00h, 00h, 00h, 00h, 00h, 00h ; M = 7
DB 01h ; START
Implementation Details
The crypto core implements the classic square-and-multiply algorithm for modular exponentiation, built from three layers:
- modm_adder — computes
(x + y) mod min 3 pipeline stages - modm_multiplier — computes
(x × y) mod musing repeated addition - modm_exponentiation — computes
y^x mod musing square-and-multiply
For 64-bit operands, a full exponentiation takes approximately 30,000 clock cycles — about 2.4ms at our 12.5 MHz system clock. Imperceptible to humans, but the CPU is free to do other work (or service other channel programs) during computation.
The design is parameterized: changing two generic constants scales it to 128-bit or 256-bit operands. The tradeoff is FPGA resources and computation time, both of which scale quadratically with bit width.
Modular Design
The channel controller now supports compile-time configuration:
entity gms_2870_multiplexor_channel is
generic (
ENABLE_CRYPTO : boolean := false
);
When ENABLE_CRYPTO is false, all crypto-related logic is optimized away by the synthesizer. Our Makefile selects the appropriate top-level:
ifeq ($(WITH_CRYPTO),1)
TOP_FILE = rtl/gms359_top_crypto.vhd
else
TOP_FILE = rtl/gms359_top.vhd
endif
Building with or without crypto is now a simple make WITH_CRYPTO=1.
CPU Enhancements
Testing the crypto processor also drove expansion of the GMS 2050 instruction set. New instructions added today:
| Opcode | Mnemonic | Description |
|---|---|---|
| 0x19 | CR | Compare Register |
| 0x1B | SR | Subtract Register |
| 0x14 | NR | AND Register |
| 0x16 | OR | OR Register |
| 0x17 | XR | XOR Register |
| 0x88 | SHR | Shift Right Logical |
| 0x89 | SHL | Shift Left Logical |
| 0x8A | SAR | Shift Right Arithmetic |
The assembler now supports "smart" mnemonics that automatically select RR (register-register) or RX (register-memory) format based on operands. LD R4, R3 assembles to LR, while LD R4, [R5+100] assembles to L.
Resource Utilization
Adding the crypto processor and new CPU instructions had a noticeable impact:
| Metric | Before | After |
|---|---|---|
| Wires | ~100K | 251K |
| Fmax | ~33 MHz | 21.75 MHz |
| Build time | 1× | 4× |
The maximum clock frequency dropped significantly, primarily due to the shift instructions (barrel shifter logic) and the 64-bit crypto datapath. However, we still have 73% timing margin over our 12.56 MHz target — plenty of headroom.
What's Next
The GMS 2950 opens up interesting possibilities:
- Larger key sizes — scale to 256-bit for real-world crypto
- RSA implementation — full encrypt/decrypt in software using the accelerator
- Performance measurement — compare against pure software implementation
- Integration with the Pico 2 — the planned GMS 2350 RISC-V accelerator could work alongside the crypto unit
Closing Thoughts
There's something deeply satisfying about watching a recreation of 1960s mainframe architecture perform modern cryptographic operations. The GateMate A1 FPGA, drawing a few hundred milliwatts from a USB port, now does what would have been unimaginable to the engineers who designed the original System/360.
The Channel I/O model proves its elegance once again: adding a complex coprocessor required no changes to the CPU. The channel handles all communication; the CPU just says "start I/O to device 2A" and waits for completion. This is the architecture that ran the world's banking systems, airline reservations, and scientific computing for decades — and it's still teaching us lessons about clean system design.
3⁵ mod 7 = 5
A small calculation. A big step for GMS/359.
The GMS/359 project recreates IBM System/360 architecture on modern FPGA hardware. Source code will be available at gitlab.com/gatemate/s359.
No comments:
Post a Comment