Previous post: Exploring x86-64 assembly through a Brainfuck JIT compiler

After the previous post, I added support for the ARM64 architecture (also called AArch64) in my Brainfuck JIT compiler. This post goes through the code generation for ARM64 and its differences from x86-64. The Zig compiler supports cross-compilation out of the box, so I was able to easily cross-compile the project to ARM on my x86-64 machine and test the JIT compiler with qemu.

Unlike for x86-64, I had no prior knowledge of ARM assembly when I started implementing the code generator. I used Godbolt to see how various C constructs can be compiled to ARM, and I checked the encoding of these instructions using an online disassembler. After some initial exploration, I found Arm’s documentation on the instruction set pretty good for writing the code generator.

ARM64 primer

On ARM64 CPUs we have 31 general purpose registers that are 64 bits long, named x0x30. In the instructions the registers are simply encoded with their 5-bit number. This encoding allows for a 32th register, which is either the stack pointer sp or the zero register xzr, depending on the instruction. The zero register acts as a register hardwired to zero, allowing us to discard the result of an operation or to use a constant 0 in an instruction that operates on registers.

We can use their lower half as 32-bit registers with the names w0w30 and wzr. Unlike x86, the registers cannot act as 16-bit or 8-bit registers.

The AArch64 ABI defines the calling convention for the architecture. Arguments are passed in x0-x8, they are also used for returning values. Other than that, they can be used as caller-saved scratch registers. The x9-x15 registers are also caller-saved, while x19-x28 are callee-saved.

ARM instructions can have a different destination register from their sources: on x86 we can use add rax, rbx to increment rax with rbx, while on ARM we can do add x0, x1, x2 to store x1 + x2 in x0.

We’ll see that ARM has separate instructions for loading and storing registers from/in memory and doing arithmetics on registers. To increment a value in memory, we need separate load, add and store instructions, unlike x86’s add [rbp], 3. A CPU architecture like this is called a load-store architecture.

Code generation

With all this knowledge, we can think about how we’ll use our registers in the generated code. Since x19 and x20 are callee-saved, we’ll have them hold the tape pointer and the environment pointer – the latter has the address of the functions written in Zig for printing and reading. The calling convention dictates that x0 and x1 are the first two arguments for calling Zig functions or being called from Zig, and x0 is also the return value from those. Finally, we’ll use x9 as a scratch register for arithmetics.

ARM encodes all instructions uniformly as 32-bit numbers. The instruction encoding is simpler compared to x86, so instead of using magic byte literals in the code generator, we’ll write Zig functions that encapsulate the encoding.

As in the previous post, we’ll map each Brainfuck instruction to CPU instructions. Our internal representation uses add with an amount for sequences of + and - instructions, move for sequences of < and >, jump_if_zero and jump_back_if_non_zero for [ and ], and write and read for . and ,.

Compiling the simplest program

Let’s use an incremental approach to first compile programs with only a few instruction types, and work our way up from there. The simplest possible Brainfuck program is the empty program. Our expectation is that we can compile and execute the empty program without crashing due to an invalid instruction or a segmentation fault. To achieve that, we still need to emit some ARM instructions to transfer control back to the function that called the compiled program.

On x86 our generated code would be jumped to – since it’s called from Zig through a function pointer using the C calling convention – with the CALL instruction, which pushes the return address on the stack.

ARM has a BLR (branch with link) instruction instead, which puts the return address in the link register x30. This way the called function can decide what to do with the return address – push it on the stack, move it to another register, or maybe even leave it in x30 if it doesn’t call other functions. BLR can only put the return address in x30, but the corresponding RET can get it from any register.

We’ll emit a single RET instruction to return immediately without doing anything, and we’ll also introduce a few registers and encoding helpers along the way:

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    for (ops) |op| {
        switch (op) {
            .add => {},
            .move => {},
            .jump_if_zero => {},
            .jump_back_if_non_zero => {},
            .write => {},
            .read => {},
        }
    }

    try genEpilogue(builder);
}

pub fn genEpilogue(builder: *jit.Builder) !void {
    try builder.emit32(
        ret(Regs.link_reg),
    );
}

const Reg = u5;

const Regs = struct {
    pub const arg0: Reg = 0;
    pub const arg1: Reg = 1;
    pub const scratch: Reg = 9;
    pub const tape_ptr: Reg = 19;
    pub const env_ptr: Reg = 20;
    pub const link_reg: Reg = 30;
    pub const zero: Reg = 31;
    pub const stack_ptr: Reg = 31;
};

fn ret(reg: Reg) u32 {
    return (0b1101011001_011111000000 << 10) | reg_reg(reg, 0);
}

fn reg_reg(r1: Reg, r2: Reg) u32 {
    return (@as(u32, r1) << 5) | @as(u32, r2);
}

Compiling writes

Now that we can compile and execute an empty program, let’s raise the bar a bit. The next simplest program that we can observe the behavior of is printing a few zero bytes to the standard output: ...

To do the actual printing, we need to load the byte pointed by the tape pointer into the x0 register, where functions expect their first parameter, and call the envWrite Zig function with that value. We can obtain the address of that function from the first entry in the environment struct, to which we have a pointer in the x20 register.

Loading a byte is done with LDRB, which loads a byte from memory and zero-extends it to a 32-bit register, so the upper 3 bytes will be zero. For loading a pointer – a 64-bit number – we have the 64-bit variant of LDR. Finally, we’ll use the previously mentioned BLR to call the function:

ldrb w0, [x19]  ; w0 = zero_extend(memory[x19])
ldr x9, [x20]   ; x9 = memory[x20]
blr x9          ; x30 = address of next instruction
                ; jump to x9

This is great, but we haven’t set up the x19 and x20 registers to actually hold the tape pointer and the environment pointer. Since we get those in function arguments, we can copy them from x0 and x1. ARM’s MOV instruction for transferring values between registers is just an alias to ORR, which performs a bitwise or: we move registers by or’ing the source with the zero register into the destination.

The x19 and x20 registers are callee-saved, so we need to restore their original value before we return to the code that called us. We also need to restore the link register x30, since we overwrite it when calling the envWrite function. To save these values, we’ll push them to the stack.

We’ll use STP to do that. It stores a pair of registers in memory, and – like the store instructions in general – it has a pre-index mode that adds an offset to the base register before storing. Using it with sp decreases the stack pointer first and stores the registers to its new value, which is exactly what we need for pushing them on the stack. Since we want to push three registers, we’ll push two register pairs with STP and one of the four registers will be xzr.

When we’re done, we’ll restore the registers with corresponding LDP instructions, using the post-index mode so it increments the stack pointer after loading.

; before the Brainfuck code
stp x30, xzr, [sp, -16]!  ; sp = sp - 16
                          ; memory[sp] = x30
                          ; memory[sp + 8] = 0
stp x19, x20, [sp, -16]!  ; sp = sp - 16
                          ; memory[sp] = x19
                          ; memory[sp + 8] = x20
orr x19, xzr, x0          ; x19 = 0 | x0
orr x20, xzr, x1          ; x20 = 0 | x1

; after the Brainfuck code
ldp x19, x20, [sp], 16    ; x19 = memory[sp]
                          ; x20 = memory[sp + 8]
                          ; sp = sp + 16
ldp x30, xzr, [sp], 16    ; x30 = memory[sp]
                          ; sp = sp + 16
ret                       ; jump to x30

One thing to note about the encoding of loading and storing 8-byte values is that the offset from the base register is scaled: it is assumed that the offset is aligned to 8 bytes, so the lower 3 bits are not stored in the instruction.

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    try genPrologue(builder);

    for (ops) |op| {
        switch (op) {
            // ...
            .write => {
                try builder.emit32s(&[_]u32{
                    // ldrb w0, [x19]
                    load_reg32_byte(Regs.arg0, Regs.tape_ptr, 0),
                    // ldr x9, [x20]
                    load_reg64(Regs.scratch, Regs.env_ptr, 0),
                    // blr x9
                    branch_link_reg(Regs.scratch),
                });
            },
            // ...
        }
    }

    try genEpilogue(builder);
}

pub fn genPrologue(builder: *jit.Builder) !void {
    try builder.emit32s(&[_]u32{
        // stp x30, xzr, [sp, -16]!
        store_pair64_pre_index(Regs.link_reg, Regs.zero, Regs.stack_ptr, -16),
        // stp x19, x20, [sp, -16]!
        store_pair64_pre_index(Regs.tape_ptr, Regs.env_ptr, Regs.stack_ptr, -16),
        // mov x19, x0
        or64(Regs.tape_ptr, Regs.zero, Regs.arg0),
        // mov x20, x1
        or64(Regs.env_ptr, Regs.zero, Regs.arg1),
    });
}

pub fn genEpilogue(builder: *jit.Builder) !void {
    try builder.emit32s(&[_]u32{
        // ldp x19, x20, [sp], 16
        load_pair64_post_index(Regs.tape_ptr, Regs.env_ptr, Regs.stack_ptr, 16),
        // ldp x30, xzr, [sp], 16
        load_pair64_post_index(Regs.link_reg, Regs.zero, Regs.stack_ptr, 16),
        // ret
        ret(Regs.link_reg),
    });
}

fn or64(dest: Reg, source1: Reg, source2: Reg) u32 {
    return (0b10101010_000 << 21) | reg_imm6_reg_reg(source2, 0, source1, dest);
}

fn load_reg32_byte(dest: Reg, base: Reg, offset: u12) u32 {
    return (0b0011100101 << 22) | imm12_reg_reg(offset, base, dest);
}

fn load_reg64(dest: Reg, base: Reg, offset: u15) u32 {
    const offset12: u12 = @intCast(@divExact(offset, 8));
    return (0b1111100101 << 22) | imm12_reg_reg(offset12, base, dest);
}

fn load_pair64_post_index(dest1: Reg, dest2: Reg, base: Reg, offset: i32) u32 {
    const offset7: i7 = @intCast(@divExact(offset, 8));
    return (0b1010100011 << 22) | imm7_reg_reg_reg(@bitCast(offset7), dest2, base, dest1);
}

fn store_pair64_pre_index(dest1: Reg, dest2: Reg, base: Reg, offset: i32) u32 {
    const offset7: i7 = @intCast(@divExact(offset, 8));
    return (0b1010100110 << 22) | imm7_reg_reg_reg(@bitCast(offset7), dest2, base, dest1);
}

fn branch_link_reg(reg: Reg) u32 {
    return (0b1101011000_111111000000 << 10) | reg_reg(reg, 0);
}

fn imm7_reg_reg_reg(imm: u7, r1: Reg, r2: Reg, r3: Reg) u32 {
    return (@as(u32, imm) << 15) | (@as(u32, r1) << 10) | (@as(u32, r2) << 5) | @as(u32, r3);
}

fn reg_imm6_reg_reg(r1: Reg, imm: u6, r2: Reg, r3: Reg) u32 {
    return (@as(u32, r1) << 16) | (@as(u32, imm) << 10) | (@as(u32, r2) << 5) | @as(u32, r3);
}

fn imm12_reg_reg(imm: u12, r1: Reg, r2: Reg) u32 {
    return (@as(u32, imm) << 10) | (@as(u32, r1) << 5) | @as(u32, r2);
}

Compiling reads

Next we can add generating code for reading a byte, allowing us to compile ,., which reads a byte and prints it back. It’s similar to writing, but the address of the envRead function is the second entry in the environment struct, so we need to load it with an offset.

After envRead returns, the result is in w0. We can store its lowest byte at a memory address with the STRB instruction.

ldr x9, [x20, 8]  ; x9 = memory[x20 + 8]
blr x9            ; x30 = address of next instruction
                  ; jump to x9
strb w0, [x19]    ; memory[x19] = lowest_byte(w0)
// ...
.read => {
    try builder.emit32s(&[_]u32{
        // ldr x9, [x20, 8]
        load_reg64(Regs.scratch, Regs.env_ptr, 8),
        // blr x9
        branch_link_reg(Regs.scratch),
        // strb w0, [x19]
        store_reg32_byte(Regs.arg0, Regs.tape_ptr, 0),
    });
},
// ...

fn store_reg32_byte(source: Reg, base: Reg, offset: u12) u32 {
    return (0b0011100100 << 22) | imm12_reg_reg(offset, base, source);
}

Compiling addition

With compiling addition we’ll cover the + and - Brainfuck instructions too, and we’ll be able to compile .+.+., which prints the bytes 0, 1 and 2.

As previously, we’ll load the byte from the current cell with LDRB. After that, we’ll perform the 32-bit addition to an immediate operand with ADD. This might set the 9th bit of the register if the unsigned result exceeds 255, but we’ll ignore that by writing the lowest byte back to memory using STRB.

ldrb w9, [x19]      ; w9 = zero_extend(memory[x19])
add w9, w9, amount  ; w9 = w9 + amount
strb w9, [x19]      ; memory[x19] = lowest_byte(w9)
// ...
.add => |amount| {
    const amount12: i12 = @intCast(amount);
    try builder.emit32s(&[_]u32{
        // ldrb w9, [x19]
        load_reg32_byte(Regs.scratch, Regs.tape_ptr, 0),
        // add w9, w9, amount
        add32_immediate(Regs.scratch, Regs.scratch, @bitCast(amount12)),
        // strb w9, [x19]
        store_reg32_byte(Regs.scratch, Regs.tape_ptr, 0),
    });
},
// ...

fn add32_immediate(dest: Reg, source: Reg, imm: u12) u32 {
    return (0b0001000100 << 22) | imm12_reg_reg(imm, source, dest);
}

Compiling moves

Next we’ll add code generation for moves, which will let us compile ,>,.<., which reads two bytes and prints them in reverse order.

We’ll use the 64-bit ADD instruction to change the tape pointer. The 12-bit immediate operand is not sign-extended to 64 bits, so adding an immediate with the highest bit set will not result in a 64-bit subtraction. This means we have to use the SUB instruction if the move amount is negative.

Similarly to the x86 code generator, we’ll assume that the move amount fits in 12 bits and let @intCast result in a runtime error otherwise.

// ...
.move => |amount| {
    try builder.emit32(
        if (amount >= 0)
            // add x19, x19, amount
            add64_immediate(Regs.tape_ptr, Regs.tape_ptr, @intCast(amount))
        else
            // sub x19, x19, -amount
            sub64_immediate(Regs.tape_ptr, Regs.tape_ptr, @intCast(-amount)),
    );
},
// ...

fn add64_immediate(dest: Reg, source: Reg, imm: u12) u32 {
    return (0b1001000100 << 22) | imm12_reg_reg(imm, source, dest);
}

fn sub64_immediate(dest: Reg, source: Reg, imm: u12) u32 {
    return (0b1101000100 << 22) | imm12_reg_reg(imm, source, dest);
}

Compiling jumps

Finally we’ll implement the last two instructions, jumping forward if the current cell is zero with [ and jumping back to the matching [ with ] if the current cell is non-zero.

First we’ll load the byte the tape pointer currently points to and check if the read byte is zero. We’ll use ANDS to do this, which does a bitwise and on two registers and sets the flags based on the result. We and the value with itself, so the zero flag tells us if it is zero. Since we don’t care about the result, just the flags, the destination register is wzr.

After that, we can use the B.cond conditional branching instruction to jump if the zero flag is set or not set:

; jump forward
ldrb w9, [x19]    ; w9 = zero_extend(memory[x19])
ands wzr, w9, w9  ; set flags based on w9 & w9
b.eq offset       ; add signed offset to instruction pointer if zero flag is set

; jump back
ldrb w9, [x19]    ; w9 = zero_extend(memory[x19])
ands wzr, w9, w9  ; set flags based on w9 & w9
b.ne offset       ; add signed offset to instruction pointer if zero flag is not set

Like in the x86 code generator, on a [ we emit an undefined instruction as a placeholder and push the current position on a stack. On the matching ] we pop the address, calculate the relative offset and fill in the jump forward.

Since all ARM instructions are four bytes long, all valid jump destinations are divisible by four, which means their lowest two bits are always zero. The instruction encoding uses this fact and doesn’t store the lowest two bits of the offset.

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    try genPrologue(builder);

    var jump_offsets = std.ArrayList(i32).init(std.heap.page_allocator);
    defer jump_offsets.deinit();

    for (ops) |op| {
        // ...
        .jump_if_zero => {
            try builder.emit32s(&[_]u32{
                // ldrb w9, [x19]
                load_reg32_byte(Regs.scratch, Regs.tape_ptr, 0),
                // ands wzr, w9, w9
                and32_flags(Regs.zero, Regs.scratch, Regs.scratch),
            });

            try jump_offsets.append(@intCast(builder.len()));
            // invalid instruction, to be filled by the matching jump back
            try builder.emit32(0x0000_dead);
        },

        .jump_back_if_non_zero => {
            const pair_offset = jump_offsets.pop();

            try builder.emit32s(&[_]u32{
                // ldrb w9, [x19]
                load_reg32_byte(Regs.scratch, Regs.tape_ptr, 0),
                // ands wzr, w9, w9
                and32_flags(Regs.zero, Regs.scratch, Regs.scratch),
            });

            const relative_offset: i19 =
                @intCast((pair_offset + 4) - (@as(i32, @intCast(builder.len()))));
            try builder.emit32(
                // b.ne dest
                branch(Cond.not_equal, relative_offset),
            );

            builder.fill32(
                // fill the matching jump:
                // b.eq $
                @intCast(pair_offset),
                branch(Cond.equal, -relative_offset),
            );
        },
        // ...
    }
}

const Cond = enum(u4) {
    equal = 0,
    not_equal = 1,
};

fn and32_flags(dest: Reg, source1: Reg, source2: Reg) u32 {
    return (0b01101010_000 << 21) | reg_imm6_reg_reg(source2, 0, source1, dest);
}

fn branch(cond: Cond, offset: i32) u32 {
    const uoffset: u19 = @bitCast(@as(i19, @intCast(@divExact(offset, 4))));
    return (0b01010100 << 24) | (@as(u32, uoffset) << 5) | @intFromEnum(cond);
}