Recently I tried out Zig by writing an interpreter and JIT compiler for the Brainfuck programming language. This post focuses on the compiler and producing machine code for the x86-64 CPU architecture. Our program reads a Brainfuck source file, compiles it to CPU instructions stored in memory and executes the generated code. The compiler targets Linux and MacOS, but it should be easy to change to support Windows as well.

Brainfuck primer

The Brainfuck programming language has only a few instructions and a very simple execution model. This makes reading and writing Brainfuck code a challenge, but it also makes the language a popular target for simple interpreters and compilers. There are also many available Brainfuck programs for testing and benchmarking, including a Mandelbrot fractal renderer and an integer factorizer.

Brainfuck programs operate on a tape, an array of byte-sized cells. A tape pointer or data pointer points to the currently targeted cell, on which the executed instruction operates. We’ll allocate 4 MiB for the tape and start with the tape pointer pointing to the middle, so we have plenty of space in both directions.

3 2 1 0 2 5 5 2 5 4 2 5 3 2 5 2
  • The + and - instructions increase and decrease the cell by one.
  • < and > move the pointer to the left or right.
  • [ jumps to the matching ] if the pointed cell is zero.
  • ] jumps back to the matching [ if the pointed cell is non-zero.
  • . prints the current cell and , reads a byte into it.

The last two instructions read and write raw bytes – if we want to print a character, the cell has to contain the ASCII value of that character. Any other character in the program source is considered a comment and ignored.

This program prints the character $:

++++++[>++++++<-]>.

It sets the first cell to 6, then adds 6 to the second cell until the first cell reaches zero. In other words, we add 6 to the second cell 6 times, so we print 36, the ASCII code of $.

Similarly, this program prints all characters from $ upwards, until the cell wraps over to zero:

++++++[>++++++<-]>[.+]

Representing Brainfuck programs

Since the language is so simple, a list of Brainfuck instructions is a representation that is already easy to interpret or compile. We’re going to make one change though, to make machine code generation simpler and to pave the way for optimizations.

The + and - instructions often occur in batches, to increase or decrease a cell by a number greater than one. Similarly, there are often consecutive < and > instructions moving the pointer by more than just one cell. We’ll store the increase and move amount along with these instructions. If this amount is signed, we don’t need to differentiate between + and -, and < and >.

For the increase instruction we’ll use a signed byte, since adding 255 to a byte is the same as subtracting 1.

To put all this in a Zig type definition:

pub const Op = union(enum) {
    add: i8,
    move: i32,
    jump_if_zero,
    jump_back_if_non_zero,
    write,
    read,
};

Executing compiled code

It’s fairly easy to parse an input file to the representation above and compress consecutive adds and moves into one instruction. Once this is done and we have our list of instructions, we want to build an array of machine code instructions in memory, and instruct the CPU to execute it. We’ll prepare the generated machine code so it ends with returning control to our program.

From the CPU’s perspective code is just data in memory. This means executing the code we stored at a memory address is just jumping to that address. However, as a user mode program running on an operating system, our process interfaces only with virtual memory managed by the OS. The CPU is only willing to execute instructions from virtual memory that has been marked as executable in the page table, and the OS won’t hand out executable memory by default.

On POSIX systems we can use the mmap function to ask for memory that can be read, written and executed, and we can write the instructions there. To avoid accidentally generating code that overwrites itself, we’ll first request memory with write permissions, copy the code to the buffer, then ask the OS to make it executable and read-only. Altering the permissions of an already mapped region of memory is done with the mprotect POSIX function. Once execution is finished, we can free this memory with munmap.

pub const Code = struct {
    const Self = @This();

    mmap_region: []u8,

    pub fn init(code: []u8) !Self {
        // mprotect expects mmap_region.len to be a multiple of the page size
        const len = (code.len / std.mem.page_size + 1) * std.mem.page_size;
        var mmap_region = try std.os.mmap(
            null, // Any address is fine
            len,
            std.os.PROT.READ | std.os.PROT.WRITE,
            std.os.MAP.ANONYMOUS | std.os.MAP.PRIVATE,
            -1,  // If we were mapping a file, this would be the descriptor
            0,   // and this would be the offset in the file
        );
        @memcpy(mmap_region[0..code.len], code);

        try std.os.mprotect(mmap_region, std.os.PROT.READ | std.os.PROT.EXEC);

        return Self{ .mmap_region = mmap_region };
    }

    pub fn deinit(self: *Self) void {
        std.os.munmap(self.mmap_region);
    }
    // ...
}

This covers writing the machine code and making it executable. How do we actually execute it? We need to prepare an environment for the generated code to run. First, we need to allocate the tape – an array of bytes –, and pass its address to the generated code. We also need to make it possible for the generated code to print and read bytes. We’ll implement these as Zig functions, and pass the address of these functions to the generated code so it can call them.

In Zig terms, it looks like this – more on the callconv parts later:

pub const Code = struct {
    // ...

    pub fn run(self: *const Self, tape: []u8) void {
        const Env = packed struct {
            write: *const fn (u8) callconv(.C) void,
            read: *const fn () callconv(.C) u8,
        };

        var env = Env{ .write = undefined, .read = undefined };
        env.write = envWrite;
        env.read = envRead;

        const code_pointer: *const fn (*u8, *const Env) callconv(.C) void =
            @alignCast(@ptrCast(self.mmap_region.ptr));
        var tape_pointer = &tape[bf.TAPE_SIZE / 2];
        code_pointer(tape_pointer, &env);
    }

    fn envWrite(ch: u8) callconv(.C) void {
        std.io.getStdOut().writeAll(
            &[1]u8{ch},
        ) catch unreachable;
    }

    fn envRead() callconv(.C) u8 {
        var buf: [1]u8 = undefined;
        const read_count = std.io.getStdIn().read(&buf) catch return 0;
        if (read_count == 0) {
            return 0;
        } else {
            return buf[0];
        }
    }
};

To jump to the generated code, we cast the pointer to the code buffer to a function pointer. By doing this, we are turning a pointer to data into a pointer to code. On the CPU level, both are just memory addresses, which are just 64-bit numbers on this architecture.

Generating CPU instructions

Now we know how we’ll execute the generated code, we just need to actually generate it from the list of Brainfuck instructions. We need a strategy to access our environment – the tape pointer and the address of the struct pointing to our external functions defined in Zig. Then, for each Brainfuck instruction we need to decide what CPU instructions we use to execute it, and encode those as bytes for the CPU.

The instructions of the CPU operate on its registers and memory. On the x86-64 architecture we have sixteen 64-bit general purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, and r8r15. We can also operate on the lower 32, 16 or 8 bits of these registers (e.g. rax, eax, ax, al). For the first four, we can also work with the upper byte of the lower 16 bits:

6 4 r a 3 x 2 e a 1 x 6 a h a x 8 a l 0

Since we’ll call functions written in Zig from our generated machine code, we need an agreement that defines what the caller and callee can expect of the state of the registers. For example, if we set rax to 1337 and call our envWrite function, can we expect that when the function returns rax is still 1337?

These agreements are calling conventions. We defined envWrite, envRead and the function pointer to the generated code as callconv(.C), which means they will be called and expect to be called using the C calling convention. On our target platform and architecture that’s the System V AMD64 ABI calling convention. It specifies that the rbx, rsp, rbp and r12r15 registers are callee-saved, meaning the callee must restore them before returning to the caller. The other registers are caller-saved; if we need them preserved, we must save them before calling a function, by e.g. writing them to memory. It also specifies that the first two integer arguments of a function are passed in rdi and rsi, and the return value is in rax.

It would be nice not having to always save the pointers to the tape and environment struct, so we’ll store them in callee-saved registers. We’ll store the tape pointer in rbp and the address of the environment struct in rbx. Reading 8 bytes from the address stored in rbx gives us the address of envWrite, and reading from 8 bytes further gives us envRead.

Compiling addition

After all this introduction, let’s start with compiling additions. We need to read the byte pointed by the tape pointer, increment it with the amount in our Brainfuck instruction, and write it back to memory. We can use al, the lowest byte of rax as a temporary scratch register:

mov al, [rbp]    ; al = memory[rbp]
add al, $amount  ; al = al + amount
mov [rbp], al    ; memory[rbp] = al

On x86 we can also do this with one instruction, since the destination of add can be a memory location. In the previous example using al as the destination register implied that we are reading and writing one byte of memory. When our destination is a memory location, we specify the operand size explicitly with the byte keyword:

add byte [rbp], $amount   ; memory[rbp] = memory[rbp] + amount

So how do we represent this instruction in memory for the CPU to execute? We can look that up in either Intel’s manual, or more condensed references like Félix Cloutier’s reference or x86asm.net. However, x86 instruction encoding is a bit complicated. We can first get a feel for it using an assembler and disassembler, like nasm and ndisasm, by writing out a few variants of an instruction and checking the output.

Let’s feed this to nasm:

[bits 64]
add byte [rbp], 0
add byte [rbp], 1
add byte [rbp], 100
add byte [rbp], 255
$ nasm tmp.s && ndisasm -b 64 tmp
00000000  80450000          add byte [rbp+0x0],0x0
00000004  80450001          add byte [rbp+0x0],0x1
00000008  80450064          add byte [rbp+0x0],0x64
0000000C  804500FF          add byte [rbp+0x0],0xff

Checking x86asm.net we can see that 80 encodes an 8-bit add with an immediate operand instead of using another register or memory as a source, or a bunch of other instructions like sub or and. The next byte is the ModR/M byte. 45 encodes that our operation is specifically add, and the destination is [rbp + disp8], where disp8 is the next byte. In our case the next byte is 00, so no offset is applied to the address in rbp. The last byte is the immediate operand we’re adding.

In our code generator we can emit these bytes when we encounter an add instruction, which will just append it to a buffer:

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    for (ops) |op| {
        switch (op) {
            .add => |amount| {
                // add byte [rbp], $amount
                try builder.emit(&[_]u8{ 0x80, 0x45, 0x00, @bitCast(amount) });
            },
            // ...
        }
    }
}

Compiling moves

To compile moving the tape pointer, we just need to change the pointer stored in rbp. Let’s do another round of assembly exploration with nasm:

[bits 64]
add rbp, 0
add rbp, 127
add rbp, 128
add rbp, -1
add rbp, -128
add rbp, -129
$ nasm tmp.s && ndisasm -b 64 tmp
00000000  4883C500          add rbp,byte +0x0
00000004  4883C57F          add rbp,byte +0x7f
00000008  4881C580000000    add rbp,0x80
0000000F  4883C5FF          add rbp,byte -0x1
00000013  4883C580          add rbp,byte -0x80
00000017  4881C57FFFFFFF    add rbp,0xffffffffffffff7f

That initial 48 is the first appearance of the REX prefix, since we are doing a 64-bit addition. We can see that once the immediate operand no longer fits in a signed byte, it is encoded as a longer, 32-bit immediate.

We can cut a corner here and throw an error if the move amount doesn’t fit in a byte. That would mean there are more than 127 consecutive < or > instructions in the Brainfuck code, which occurred in none of my test cases. In Zig @intCast conveniently casts i32 to i8 and throws a runtime error if that results in data loss:

// ...
.move => |amount| {
    // add rbp, byte $amount
    const byte: i8 = @intCast(amount);
    try builder.emit(&[_]u8{ 0x48, 0x83, 0xC5, @bitCast(byte) });
},
// ...

Compiling jumps

Next let’s compile [, which jumps to the matching ] if the tape pointer points to 0. To do this, we can load the pointed byte to al and subtract 0 from it. If al becomes zero, the instruction sets the zero flag in the rflags register. After that, we can use the jz (jump if zero) instruction to jump to the address.

Since this is the primary way we do comparisons, and we don’t always want to actually change the destination register, the cmp instruction performs a subtraction and changes only the flags, but does not store the result:

mov al, [rbp]  ; al = memory[rbp]
cmp al, 0      ; set flags based on (al - 0)
jz offset      ; add offset to the instruction pointer if the zero flag is set

Jumping back is similar, but we jump if the value is non-zero:

mov al, [rbp]  ; al = memory[rbp]
cmp al, 0      ; set flags based on (al - 0)
jnz offset     ; add offset to the instruction pointer if the zero flag is not set

So how do we know where to jump, how do we calculate the offset? We’ll use a stack during code generation to keep track of the address of the jumps. When we encounter a [, we emit a jump with a placeholder offset and we also push the address of this jump instruction on the stack. When we encounter a ], we pop the address of the matching [. From that, we can calculate the offset to jump back, and we can also fill in the placeholder with the forward jump offset. Note that the offset is a signed offset that is added to the address of the next instruction, rather than the current one.

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    var jump_offsets = std.ArrayList(i32).init(std.heap.page_allocator);
    defer jump_offsets.deinit();

    for (ops) |op| {
        switch (op) {
            // ...
            .jump_if_zero => {
                // mov al, [rbp]
                try builder.emit(&[_]u8{ 0x8a, 0x45, 0x00 });
                // cmp al, 0
                try builder.emit(&[_]u8{ 0x3c, 0x00 });

                try jump_offsets.append(@intCast(builder.len()));

                // jz ...
                try builder.emit(&[_]u8{ 0x0f, 0x84, 0, 0, 0, 0 });
            },

            .jump_back_if_non_zero => {
                const pair_offset = jump_offsets.pop();

                // mov al, [rbp]
                try builder.emit(&[_]u8{ 0x8a, 0x45, 0x00 });
                // cmp al, 0
                try builder.emit(&[_]u8{ 0x3c, 0x00 });
                // jnz pair_offset
                try builder.emit(&[_]u8{ 0x0f, 0x85 });

                const relative_offset =
                    (pair_offset + 6) - (@as(i32, @intCast(builder.len())) + 4);
                try builder.emit32(@bitCast(relative_offset));

                // fill the offset in the matching jz
                builder.fill32(
                    @intCast(pair_offset + 2),
                    @bitCast(-relative_offset),
                );
            },
            // ...
        }
    }
}

Compiling writes and reads

The only instructions left to implement are writes and reads. These will call the function addresses stored in [rbx] and [rbx + 8]. We need to keep in mind that the functions expect their first argument (if they have one) in rdi and their return value is in rax.

The call assembly instruction pushes the address of the next instruction on the stack and jumps to the supplied address. The called function can use ret to jump back to the address on the stack.

What is the stack? The stack is a region of memory we can use as, well, a stack with some special instructions. It is most commonly used for return addresses, and local variables and function arguments that don’t fit in the registers. The rsp register is the stack pointer. The push instruction subtracts the operand size from rsp and writes the value to [rsp]. The pop instruction does the opposite, it reads [rsp] and increments rsp. The more data we have pushed on the stack, the lower rsp is. This is usually phrased as “the stack grows downwards.”

The assembly for our writes and reads looks like this:

; write
mov rdi, [rbp]  ; rdi = memory[rbp]
call [rbx]      ; push return address, jump to [rbx]

; read
call [rbx + 8]  ; push return address, jump to [rbx + 8]
mov [rbp], al   ; memory[rbp] = al

And the code generation in Zig looks like this:

// ...
.write => {
    // mov rdi, [rbp]
    try builder.emit(&[_]u8{ 0x48, 0x8b, 0x7d, 0x00 });
    // call [rbx]
    try builder.emit(&[_]u8{ 0xff, 0x13 });
},

.read => {
    // call [rbx+8]
    try builder.emit(&[_]u8{ 0xff, 0x53, 0x08 });
    // mov [rbp], al
    try builder.emit(&[_]u8{ 0x88, 0x45, 0x00 });
},
// ...

Adding a prologue and epilogue

Now we have only a tiny bit to do before we have working code generation. Remember that our generated code is called from Zig using the C calling convention:

const code_pointer: *const fn (*u8, *const Env) callconv(.C) void =
    @alignCast(@ptrCast(self.mmap_region.ptr));
var tape_pointer = &tape[bf.TAPE_SIZE / 2];
code_pointer(tape_pointer, &env);

This means the tape pointer and the pointer to Env will be passed in rdi and rsi, but we need them to be in rbp and rbx. Since rbp and rbx are callee-saved – the exact reason we chose them – we also need to restore them to their original value once the generated code has run. Finally, we need to jump to the return address on the stack to give back control to the Zig code that called us.

To save the registers we push them to the stack, to restore them we pop them in reverse order:

; before the Brainfuck code
push rbp      ; save rbp to memory
push rbx      ; save rbx to memory
mov rbp, rdi  ; rbp = rdi
mov rbx, rsi  ; rbx = rsi

; after the Brainfuck code
pop rbx       ; restore rbx
pop rbp       ; restore rbp
ret           ; jump to return address on stack

Adding these instructions before and after the rest of the code concludes our code generation:

pub fn gen(ops: []const bf.Op, builder: *jit.Builder) !void {
    try genPrologue(builder);

    var jump_offsets = std.ArrayList(i32).init(std.heap.page_allocator);
    defer jump_offsets.deinit();

    for (ops) |op| {
        // ...
    }

    try genEpilogue(builder);
}

fn genPrologue(builder: *jit.Builder) !void {
    // push rbp
    try builder.emit8(0x55);
    // push rbx
    try builder.emit8(0x53);
    // mov rbp, rdi
    try builder.emit(&[_]u8{ 0x48, 0x89, 0xfd });
    // mov rbx, rsi
    try builder.emit(&[_]u8{ 0x48, 0x89, 0xf3 });
}

fn genEpilogue(builder: *jit.Builder) !void {
    // pop rbx
    try builder.emit8(0x5b);
    // pop rbp
    try builder.emit8(0x5d);
    // ret
    try builder.emit8(0xc3);
}

With this, we can generate machine code from a parsed Brainfuck program and execute it. The only things left to do are parsing the source file and run the whole thing with an allocated tape, and we have a simple working JIT compiler.

Next posts: