woensdag 29 juli 2015

Tiles and Compiler Compilers

This week I made progress on the central 'tiling' algorithm used for code generation. I think it makes an interesting story and theory, so I will try to explain it in this post. Unfortunately, I don't think I can explain it very well without resorting to computer-science jargon, but I hope you can forgive me.

What is tiling?
Tiling refers to the process of breaking an input tree into many pieces. The tree refers to the data structures representing the code that the JIT should compile. As I wrote a few weeks back, these trees are ultimately generated from templates, mapped to the MoarVM opcodes. Let's suppose we have a tree for the expression: result = LOCAL[8] + 12 and that it'd look like the code below:

(add 
  (load
    (addr
      (local)
      8)
    int_sz)
  (const
     8
     int_sz))

I realize most people probably are not familiar with the syntax used here. These are called s-expressions, and there is fortunately little to know: '(' opens a list, ')' closes a list, and words represent symbols that are reproduced verbatim (as are the numbers). If we act like the first word in the list reperesents the name of a function and the rest of the list the arguments, it hopefully becomes clear how the text represents the expression (and you can add LISP to your linkedin profile at the same time).

For those who prefer a visual layout (which is most people, I guess), the following graph represents the exact same code:

The code fragment under consideration.

Tiling is necessary because there are many ways to implement this expression in machine code. For instance, every node can be mapped to a single instruction, as in the naive code below. Compare this to the improved code, which is only two instructions long:
TreeNaive codeImproved Code
(add 
  (load
    (addr
      (local)
      8)
    int_sz)
  (const
     12
     int_sz))
mov r0, 12
lea r1, [rbx+8]
mov r2, [r1]
mov r3, r0
add r3, r2
mov r0, [rbx+8]
add r0, 12


As you can hopefully see from the color, in the improved code each instruction refers to multiple parts of the tree. As a result, the improved code is much shorter, and probably faster to execute.
How do we do it?
There are two basic abstractions in tiling.  The first of these is the tile grammar. A very simple grammar is shown below. Each of these tile rules maps a tree fragment to an action, at terminal and a cost. A tile rule that matches a tree fragment conceptually replaces the fragment with it's terminal; the resulting terminal can be used to match other tiles.

1:  (tile: local    (local) reg 1)
2:  (tile: addr_reg (addr reg) mem 1)
3:  (tile: addr_mem (addr reg) reg 2)
4:  (tile: const    (const) reg 2)
5:  (tile: load_reg (load reg) reg 5)
6:  (tile: load_mem (load mem) reg 5)
7:  (tile: add_reg  (add reg reg) reg 2)
8:  (tile: add_cnst (add reg (const)) reg 3)
9:  (tile: add_ldrg (add reg (load reg)) reg 6)
10: (tile: add_ldmm (add reg (load mem)) reg 6)

The second abstraction is the rule set. A given tree node can potentially be matched by any number of rules. For instance, (const) may be matched by rule 4 or it may be matched as the 'subrule' of rule 8. (We denote the subrule of rule 8 as rule 8a). In fact, there is no way the matcher can distinguish between these two options when it evaluates the (const) node. The matcher can only distinguish between the two options when evaluating the parent node of const. (Parents often know more than their children). Hence, mapping a node often yields multiple rules.

This is something of a challenge for matching the parent node. In case of the (load) node in the graph above, do we match to rule 6 (load mem) or rule 5 (load reg)? The (addr) node can map to either a reg or a mem terminal, so it does not reduce the ambiguity. The answer is that rather than trying to cut through the ambiguity we should embrace it. That is to say, we represent the combination of rules as a single ruleset, and the ruleset represent all possible matching rules.

For example, in the grammar above, a (const) node by itself always matches to rule 4 or rule 8a. So the (const) matches a single ruleset consisting of { 4, 8a }. Similarily, an (addr) always takes a reg terminal and maps to both rules { 2, 3 }. In constrast, the (load) node can match rule 5 - if it's child matches to a reg terminal - or rule 6 if it's child matches to a mem terminal. (It can also match to rule 9a and 10a, but ignore that for simplicity). Since all nodes that generate a mem terminal (i.e. the (addr)) can also generate a reg terminal, rule 6 is always combined with rule 5, but the inverse is not the case. (It's perfectly possible to (load) the result of an (add) operation, for example). Thus, (load) maps to two distinct rulesets: { 5 } and { 5, 6 }.

Table generation
It is pretty easy to determine whether a rule will match a node and a combination of rulesets: just try if any of those rulesets can generate the required terminals. Checking this for all rules available will then give you a new combination of rules, which is also represented with a ruleset.  Better yet, knowing the costs associated with each rule, one can determine the optimal rule to compute a node to the terminal required. For instance, in the tree above:
  1. (local) can only be matched by rule 1 (ruleset {1}).
  2. (addr {1}) can be matched by rule 2 and 3 equally. (ruleset {2,3})
  3. (load {2,3}) can be matched by rule 5, 6, 9a and 10a, because the ruleset {2,3} from (addr) generates both mem and reg terminals. (ruleset {5,6,9a,10a}).
  4. (const) can be matched by rule 4 and 8a (ruleset: {4, 8a}).
  5. (add) can be matched by rule 7 and 8, because ruleset {5,6,9a,10a} can generate a reg, and ruleset {4, 8a} can generate the (const) placeholder expected by rule 8. Hence (add {5,6,9a, 10a} {4, 8a}) yields ruleset {7,8}.
Now to determine the optimum code:
  1. (add) can best be represented by rule 8, because this has lower cost than the combination of rule 4 and rule 7.
  2. (load) can best be represented by rule 6, because this has lower cost than rule 3 and 5 combined.
  3. (const) requires no specific representation being embedded in rule 8.
  4. The same goes for (addr) and (local).
In fact, you can compute this information prior to the actual compilation process, and stick it in a table - simply by applying all rules to all combinations of rulesets. Doing this transforms the ambiguous, nondeterministic matching process into a deterministic table lookup. In CS jargon, it transforms the NFA represented by the tile grammar into a DFA. As in all such transformations, this takes up significant amounts of space.

Let's keep it simple
So much space, in fact, that we're not home and dry just yet. A table mapping the combination of a node and two children - indexed by ruleset - must be at least O(nchild × n2ruleset) large. If we naively generate all combinations of rules that start with a given head node we generate 2n rulesets per type of head. Some heads are potentially involved with over 10 rules (consider (load), which is allowed in nearly all x86 instructions), giving - naively - 1024 rulesets. Most of these rulesets are impossible to generate. For example, in our miniature grammar, a ruleset containing {8,9} clearly cannot be generated. It is therefore in our keen interest to generate the minimum amount of rulesets.

But that is pretty complicated: it either requires rather sensitive analysis of the grammar, which isn't algorithmicly cheap by itself; or we can simply read all the rulesets that can be generated from the grammar, by constructing the table that generates them. Clearly that is a chicken-and-egg problem: to find the rulesets that can be generated by a grammar, we have to make a table based on just those rulesets. Fortunately, chicken-and-egg problems can usually be solved by using some form of topological sorting. To put it in other words, we don't need to have all rulesets available to find the combination of rules the grammar can produce, just some that generate all the terminals needed by a given node. In our grammar above, we can just start by generating all rules for (const) and (local), noting that they generate one ruleset each. After that is done, we can generate all rules that rely only on reg, which is the (addr) rule (generating mem). We continue process this until all rulesets have been generated. This dramatically reduces the size of the table, which is still pretty large. Without this procedure however, the time taken to build the table tends to explode on relatively small grammars.

Ultimately the tiler table must be available for the JIT compiler, which is written in C. The tile table generator is written in perl5 (just like the expression template compiler), because, manipulexity and whipuptitude, and it runs everywhere, you know? In fact, perl5 is already a requirement for building MoarVM, which means I wouldn't introduce new build-time dependencies. (Nobody likes installing a build-time dependency, least of all me). Perl5 natively supports hash tables; C does not. So I chose to represent the table as a sorted array of key + value and use binary search to find the right items. There are certainly more efficient representations, but this is very simple and still guarantees quite adequate lookup times. This is important in ensuring the JIT compiler won't become a bottleneck itself.

So that is the story of how I wrote the tiler table generator (and incidentally, a story how perl saved the day). With these in place, I can implement the final tiler quite trivially (I already did, again, in perl). I conclude with noting that while the Aho algorithm guarantees optimal tiling (within the limitations of the grammar), it is not an optimization method by itself. For truly good code - say, like GCC or LLVM can produce - much more is needed: instruction scheduling, register allocation, and true optimisation techniques. Until my next report, I'll be working on these topics. See you then!

3 opmerkingen:

  1. Hey Bart,
    nice explanation, I really like it.

    There's a minor glitch in the example s-expr: the "8" under "const" should be "12", I think.
    Btw, stating that an s-expr is nothing but a linear, and very concise notation for defining a tree, plus maybe a wikipedia link https://en.wikipedia.org/wiki/S-expression might be helpful for some.

    Also, your use of the term "terminal" seems not entirely to be in line in all places with what one's used to when thinking of it terms of grammars, which confused me a bit at first reading. Maybe recast it using "non-terminal" and "reduction" as well?

    Well then, apart from the above (minuscule) pts, really good job!
    Looking forward to reading your next writings, maybe including common sub-expression elimination?

    gtx, meisl

    BeantwoordenVerwijderen
    Reacties
    1. You're absolutely right, I've been sloppy with my terminology for terminals and nonterminals. Partly because I wanted to not introduce too much jargon, and partly because I'm not 100% sure about the correct use myself :-)

      I'll add the link to s-expression syntax. Thanks for the comment

      Verwijderen
  2. Hi Bart,

    normally, I manage to refrain from commenting right after having done only a preliminary cross-reading - but in this case I just have to say "kudos!", right away!

    Not only have you largely extended the "why&how"-part (while considering my humble incitations re "formal-languages-speak") but also included some complexity considerations. That's great, imho; I think you need NOT apologize for using CS concepts or jargon, really.

    So much for now (I'm too curious to read your follow-up post first).
    So, again: good job, thx!

    meisl

    BeantwoordenVerwijderen