zondag 18 mei 2014

MoarVM as a Machine

If you read my blog, you'll likely know what MoarVM is and what it does. For readers who do not, MoarVM is a virtual machine that is designed to execute perl6 efficiently. Like a real computer, a virtual machine provides the following:
  • A 'processor', that is to say, something that reads a file and executes a program. This simulation is complete with registers and an instruction set.
  • An infinite amount of memory, using a garbage collector schema.
  • IO ports, including file and network access.
  • Concurrency (the simulation of an infinite amount of processors via threads)
In this post I'll focus on the 'processor' aspect of MoarVM. MoarVM is a 'register virtual machine'. This means simply that all instructions operate on a limited set of storage locations in which all variables reside. These storage locations are called registers. Every instruction in the bytecode stream contains the address of the memory locations (registers) on which it operates. For example, the MoarVM instruction for adding two integers is called add_i, and it takes three 'operands', one for the source registers to be added together and a third for the destination register to store the result. Many instructions are like that.

A register VM is often contrasted with a stack VM. The Java Virtual Machine is a well-known stack VM, as is the .NET CLR. In a stack VM values are held on an ever growing and shrinking stack. Instructions typically operate only on the top of the stack and do not contain any references to memory addresses. A typical stack VM would add two numbers by popping two of the stack and pushing the result.

Why was the choice for a register VM made? I'm not certain, but I think it likely that it was chosen because register machines are frequently faster in execution. In brief, the trade-off is between instruction size on one hand and total number of instructions needed to execute a given program. Because stack VM instructions do not contain any addresses (their operands are implicitly on the stack), they are smaller and the VM has to spend less time to decode them. However, values frequently have to be copied to the top of the stack in order for the stack machine to operate on them. In contrast, a register machine can just summon the right registers whenever they are required and only rarely has to copy a value. In most VM's, the time spent executing an instruction is much larger than the time spent decoding it, so register VM's are often faster. 

From the point of view of somebody writing a (JIT) compiler (like myself), both architectures are abstractions, and somewhat silly too. All actual silicon processor architectures have only a limited number of registers, yet most 'register' VM's - including MoarVM - happily dole out a new set of registers for every routine. In some cases, such as the Dalvik VM, these registers are explicitly stack-allocated, too! The 'register' abstraction in MoarVM does not translate into the registers of a real machine in any way.

Nonetheless, even for a compiler writer there is a definitive advantage to the register VM architecture. To the compiler, MoarVM's instructions are input, that is to be transformed into native instructions. The register VM's instructions are in this sense very similar to something called Three Address Code. (Actually, some MoarVM instructions take more than three operands, but I'll get to that in a later post). A very convenient property of TAC and MoarVM instructions alike is that every variable already has its own memory location. In contrast, in a stack VM the same variable may have many copies on the stack. This is inconvenient for efficient code generation for two reasons. 

First of all, naively copying values as they would be in the stack VM will lead to inefficient code. It may not be obvious which copies are necessary and which are redundant. Nor is it immediately obvious how much run-time memory compiled routine would use. To efficiently compile stack VM code a compiler might do best to translate it into Three Address Code first.

But the second reason is perhaps more profound. Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:

function foo(a) {
    var r = 0;
    for (var i = 1; i < a.length; i++) {
        r += (a[i] * a[i-1]);
    return r;


If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'.  Note that spesh, which is the optimisation framework for MoarVM, also works this way

The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed locationHowever, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple.

And in my opinion, simplicity wins.

zondag 11 mei 2014

DynASM is awesome

As part of my ‘community bonding’ period, I’ve taken it upon me to write a small series of blog posts explaining the various parts I’ll be using to add a JIT compiler to MoarVM. Today I’d like to focus on the DynASM project that originates from the awesome LuaJIT project.

DynASM is probably best described as an run-time assembler in two parts. One part is written in lua and acts as a source preprocessor. It takes a C source file in which special directives are placed that take the form of assembly-language statements. Here is a fully worked-out example. These are then transformed into run-time calls that construct the desired bytecode. The generated bytecode can be called like you would a regular function pointer.

DynASM has no run-time dependencies. But to run the preprocessor you will need lua as well as the Lua BitOp module (also from the luajit project). The run-time part is contained within the headers. DynASM is licensed under the MIT license and supports many different architectures, including x86, x64, ppc, and arm. DynASM also intergrates neatly into a Makefile-based build.

In many respects DynASM is an ideal tool for this particular job. However, it also has a few drawbacks. The most important of these is the lack of documentation. With the exception of a few scattered blog posts , there is barely any documentation at all. For many of the simple operations, this is sufficient. For the more complex things, such as dynamic register selection, or dynamic labels, it seems there is no other option than to ask directly. (FWIW, the 'dynamic registers' question was asked an answered only two days ago on the luajit mailing list). However, I think the benefits of using DynASM outwheigh these issues.

For my next blog, I'll be looking at the MoarVM machine model and bytecode set, especially in relation to x64. Hope to see you then.