zondag 20 september 2015

Most Significant Bits

This week I think I fixed irregular behavior in the x64 instruction encoding register selction of DynASM. I think it'll be a fun story to share, so I thought it'd be time to blog.

The astonishingly irregular thing about x64 instruction encoding is that it is mostly very regular. Ignoring for the moment instruction prefixes and constants, an x86 instruction consists of two bytes, one for the instruction proper, and one for it's two operands. Take for example the addition of two registers:

+-----------------+
| add  | eax, ecx |
+-----------------+
| 0x01 |     0xc8 |
+-----------------+

We take the instruction byte as given. It is the second byte that concerns me because it determines which operands to use and how. Like a good CISC architecture, the x86 supports a number of addressing modes, meaning that a register can be used as a value but also as a (part of a) pointer. One of the reasons C does pointer arithmetic so freely is that this reflects the nature of the CPU's which where current when C was designed. The (relevant) x86 addressing modes are shown in the following table. (There are more, but you shouldn't use them):

Addressing ModeByte flagMeaning
Direct0xc0Both operands are used as direct values
Indirect0x00One of the operands is used as a memory reference, whether the source of destination operand depends on the instruction
Indirect with offset0x80 or 0x40One of the operands is used as a memory reference, offset by a constant which is encoded directly after the instruction.
Indexed0x04One operand consists of two registers base and index, which is multiplied by a scale, to provide a memory reference. Optionally also has an offset, in which case the code byte is 0x44 or 0x84, depending on whether the offset fits in a single byte or not.
Instruction-relative0x05One of the operands is a reference to the current location in the code offset by a constant (and the other refers to a register as usual).

Readers which are more careful than can be reasonably expected will have noticed that in the first three addressing modes, the lowest nibble is zero, whereas it is nonzero for the lower two addressing modes. This is in fact the source of irregularities in instruction encoding. To appreciate this it helps to unpack the operand byte in octal rather than hexadecimal. Octal is much closer to how x86 thinks about the world. As demonstrated in this table, the lowest two pairs of 3 bits each encode the actual registers that should be used.

+------+------+-----+-----+----------+
| Byte | Mode | Op2 | Op1 | Meaning  |
+------+------+-----+-----+----------+
| 0xc0 |    3 |   0 |   0 | Direct   |
+------+------+-----+-----+----------+
| 0x80 |    2 |   0 |   0 | Indirect |
+------+------+-----+-----+----------+
| 0x04 |    0 |   0 |   4 | Indexed  |
+------+------+-----+-----+----------+

The upshot of this is that in case the operand mode isn't direct, and the first operand register is either 4 or 5, the meaning of the operand byte is completely different. x86 suddenly expects another operand byte (a so-called SIB byte) to specify which register shall be base and which shall be index.

Normally this isn't much of a problem since the registers refered to by number 4 and 5 are rsp and rbp respectively; meaning the stack top and stack bottom registers. Fun fact: the x86 stack grows downward, so rbp > rsp in basically all cases. Also fun fact: because of this, writing from a rsp relative reference upwards can overwrite the return pointer held somewhere below rbp, which is the basis of most buffer overflow attacks. You thought NULL was a billion dollar mistake? Consider how the engineers that decided the stack should grow downward must feel.

Anyway, considering that rbp and rsp take up such a pivotal role in your program, it's actually unlikely you'll encode them by mistake. So as long as you don't do that, it's safe to ignore this complexity and just 'add in' the correct bits into the operand byte. Thus, for instance, to refer to the 7th and first register respectively in direct mode, we generate:

0300 + (07 << 3) + (01) == 0371 == 0xf9

However, in the land of x64, things are not so happy. I have blogged earlier about how the x64 architecture gives you 8 extra registers on top of the 8 legacy x86 registers, and how the difference between those registers lies only in that x64 specifies a prefix byte called REX. REX byte encoding is not so difficult, the trick is to find it reliably.  But because the lower three bits of the 4-bit register number are placed in the operand byte, register r12 and r13 look exactly like rsp and rbp to the CPU. Well, that's where the fun really starts, because it's all too easy to encode these 'accidentally'. They are after all perfectly regular registers.

For those not keeping score, we have two special cases to handle. First, whenever the first operand is either rsp or r12 and we're not using direct mode, an extra SIB byte needs to be encoded to specify that we are really talking about accessing rsp/r12 directly. This is done by encoding rsp as both the base and index, which the x86 understands because using rsp as an index is usually illegal. (The magic byte is thus 0044 or 0x24). Second, whenever the first operand is rbp or r13 and we're using indirect access without an offset, we need to encode indirect access with an offset instead, just with the offset at zero. This of course requires another byte. Somewhat byzantine, but manageable.

We are, unfortunately, not completely OK yet. It is my central hypothesis of this post that DynASM was not designed to handle register selection at runtime. Evidence for this hypothesis is that DynASM does 'weird' things like mix data and instructions and linking prior to encoding. Usually one encodes first and links afterwards, especially when during encoding you may need to make decisions that influence the final positions of certain segments. DynASM does it the other way around, which means that during linking we should be able to calculate exactly how much space we need for each instruction. Which is a pain, because DynASM mixes the data stream (which we need for inspection) with the instruction stream (which tells the runtime what to do with its input). It's possible to hack around this - basically by copying data into the instructions - but it's not elegant. Starting with this commit, I'm reasonably confident that this stuff works, a test case is provided here.

That almost concludes this weeks madness. The only thing left is to question the method. Why should x86 use this convoluted scheme? I could go on a detailed historical research, but I prefer to speculate it is caused by economy in memory. After all, in the regular case you need only 2 bytes, which is - conveniently - equal to 16 bit, the original register size of the 8086. And since that chip was designed in the 1970s, it makes sense instructions should be as space-efficient as possible. In contrast, ARM uses 32 bit instructions with 3 operands. So space economy seems a plausible cause to me.

See you next time!

maandag 14 september 2015

Wrapping Up

Hi everybody, it's been a while since I've blogged, and I think you deserve an update. Last week, of course, was YAPC::EU, which was awesome. Granada is a very nice place, and the weather was excellent. Tapas lunch was very nice, as was the gala diner (also with tapas). There were many interesting people and presentations (many more than I could actually see). It was also very interesting to present (slides) about the JIT, which I think went well. One of the comments I heard was that it was a quite high-level talk, so if anyone should ask, I can and will give a talk describing the grueling details in the future. Oh, and I just found that Larry Wall's keynote has just been uploaded to youtube. Go ahead and watch, this page isn't going anywhere.

So what news from JIT compiler land? Well, since last I've blogged I had to design and implement the register allocator and tie everything together into the final running compiler. That has been achieved, but there are still many bugs and limitations that prevent it from actually being effective. I will talk about these at length. But first, let me explain the register allocator.

The basic problem of register allocation is that a compiler can assign more values than the CPU has registers. Thus, only some values can reside in registers and others must be stored to memory (spilled). Computing the best set of values to keep in registers is in general an impossible problem. To solve it I use the 'linear scan' allocation heuristic. This heuristic is as simple as can be - determine for each value allocated it's last use, and when it's time to spill a value, spill the one which will be live furthest in the future. These slides also do a good job of explaining it.

Aside from being simple to implement, it's also effective. Values that will expire soon are likely to be used soon as well, so their register will free up soon enough. On the other hand, values that are bound to live a long time will likely be spilled before they expire anyway, so it costs less to spill them now. Another benefit is that this can be evaluated online, i.e. as the JIT tree is being processed, so that it doesn't require a separate step. (One abstraction it did require, though, was order-numbering the tiles, which are currently attached to the JIT tree. This makes the tiles in essence a linear list, and it is my plan to convert them to a linear array in memory as well. That would reduce the number of tree traversals by one as well).

I will not maintain the illusion here that a register allocator is a trivial component, even with a conceptually simple algorithm as just outlined. One of the tricky bits is ensuring that values defined in a conditional branch do not accidentally 'escape' out of their branches, since they will be unavailable if the branch was not taken. In the same vein after a call all 'volatile' registers are invalidated, so that also requires some special treatment.

After the register allocator was finished, all that remained was ironing out the bugs. And they were (and are) many. The most annoying of these were the DynASM machine code generation bug. The DynASM runtime is a tight state machine that uses a mix of data and instructions to generate machine code. The first bug was relatively simple - a missing REX byte marking caused by the preprocessor looking at only one of the two operands. The second bug was positively evil. See, x86 uses so called ModR/M bytes to specify the register and memory access mode used for an instruction. It's not important you know what it stands for, but what is important is that each byte - 8 bits - is divided into a mode of 2 bits and 2 register numbers of 3 bits each. (There are 8 legacy registers in x86, so that fits). Except when the register number is 4 (rsp or the stack pointer register). Then the whole meaning of the byte changes to a SIB byte, which is quite different entirely - it refers to two registers at once. The upshot is that this SIB byte now must be appended to the ModR/M byte and filled with correct data; and that this SIB byte is then interpreted as if it were a ModR/M byte anyway.. I've patched DynASM to do this, but it is really quite sensitive and brittle and I quite expect to have to fix this in the future in another way.

That brings me to today. Unfortunately for my JIT aspirations, my educational obligations have caught up with me again. In other words: my study has started and leaves me with much less time to work on the JIT. So for clarity, I have completed in the last few months the following:
  • The expression tree intermediate format and template preprocessor
    • Documentation on the use of these
  • The tiler and tiler table generator
  • A register allocator and assorted support structures in the new JIT
  • Dynamic register support for x64 DynASM 
  • Periodic reports on progress by means of this blog
Timo Paulssen has developed a C-source to expression tree format preprocessor, which should in the best case help quickly convert most of the old JIT compiler segments to the new, when this is more stable.

Here is what I intended (and indeed promised) I would do, and haven't finished yet:
  • A test suite - I haven't needed it during development as regular NQP compilation already involves plenty JIT compiler activity. The benefit of a real test suite is of course that specific segments can be tested (like numeric registers)
  • An representation-based extension API. I haven't gotten around to it because the JIT has not been stable enough so far. It will be based on the use of templates specialized to a certain in-memory representation.
  • Documentation on the use and development of architecture-specific tiles.
And there are things which I haven't outlined explicitly but which I still want to do:
  • Full floating point operation support (the simplest way is probably to use a wholy separate allocator for numeric registers).
  • Automatic within-template ordering to maintain consistent branch semantics (rather than relying on let ordering to build values shared between branches).
  • A framework for optimizing transformation. I think the key to this is probably a replacement table allowing different algorithms to insert replacements for given nodes.
  • As I said above, a truly linear representation of the selected tiles, in order to allow reordering for the purposes of instruction scheduling.
And there are many more. But as I said, I cannot continue with them as I have in the past few months. I will continue to work to stabilise the JIT and I hope to succeed in that before Christmas, only at a slower pace than before. I realise it is somewhat disappointing not to be able to fully 'land' the new JIT yet, but I'm confident we'll get there in the end. See you next time!