Cranelift: consider adding a loop-peeling optimization pass #10111

fitzgen · 2025-01-24T23:57:19Z

That is, a pass that unrolls one iteration of the loop body just before the actual loop. This is often beneficial because loop bodies might contain conditional code that only executes on the first iteration.

There could be additional benefit for Cranelift in particular: right now, Cranelift will deduplicate various kinds side-effecting-but-idempotent instructions (such as trapzs) but will not move those instructions outside of their original blocks and hoist them just above the loop even when it would be valid to do so. But if we have already unrolled the first iteration of a loop, then that unrolled first iteration will dominate the rest of the loop and when we dedupe those instructions from inside the loop body to their unrolled counterparts, we are effectively doing exactly that code motion we wanted to do before but could not.

Random tidbits:

Could we do this at the same time as the aegraphs pass? I think so? But it would be in the same traversal, not actually in ISLE, similar to eg the alias analysis optimization.
We will need some heuristics for when to peel loops or not. Don't think we want the code bloat of peeling every single loop in the whole program. Maybe just innermost loops? Maybe just loops of up to a certain size? Lots of experimenting to be done in this regard.

The text was updated successfully, but these errors were encountered:

fitzgen · 2025-01-25T00:01:38Z

If we don't care about cleaning up conditional code that only happens in the first iteration of the loop, and can statically be shown as such after the peeling, and we only care about being able to LICM side-effectful-but-idempotent instructions, then we could also just investigate extending aegraph elaboration to support LICM'ing the idempotent parts of the side-effectful skeleton when that wouldn't move one side-effectful instruction across another one. But that kinda feels a little too specific and one-off for my tastes, in comparison to the peeling which effectively gives us that and also some more stuff all at once (ie is more general/powerful and probably not very different in terms of engineering effort?).

cfallin · 2025-01-25T00:20:42Z

I'd be interested in working through some example IR you've see to motivate this -- the main reason being that my preference is actually the opposite, toward better LICM rather than general peeling. The main reasons are that:

Peeling doubles the size of the loop (in the general case), as you mention in your second bullet point; that seems to me like it will be the dominant negative effect and a large one. Consider the two universes we could choose with our heuristics:
- We peel only inner loops or small loops, as suggested. Then any benefit of hoisting traps happens only for some code in the program; and there are surprising performance cliffs that occur. (Adding a few instructions causes the whole loop to slow significantly; adding another nested loop somewhere turns the main loop into "not an inner loop anymore".) This seems generally contrary to the "predictable performance" design goal of Wasm and by extension of Cranelift.
- We peel all loops. Then we have significant code expansion, and compile time slowdown: somewhere between 1x and 2x depending on how much of a program's body lives in loops. (Or more, depending on how we handle nested loops.) Note that will also turn into runtime slowdown if there is any icache presure.
Peeling is a pretty complex change to the aegraph visitor pass; it breaks some fundamental invariants and adds a new dimension of complexity (output CFG is not input CFG, with all the associated implications for SSA, the domtree, and otherwise).
The main benefit on its own is LICM-for-free (but then at the cost of a doubling of code size); additional benefits, like specializations for the first iteration, require other work like branch folding which we don't do at all currently. We could revisit peeling when we do branch folding as both are control flow/code motion optimizations.

In contrast, "better LICM" feels much more tractable to me: it doesn't require bulk CFG manipulation (only instruction placement/scheduling which we're already doing) and gets the same concrete benefits on e.g. repeated trap checks without paying the large associated cost (duplicating the loop body).

fitzgen · 2025-01-28T18:05:02Z

I'd be interested in working through some example IR you've see to motivate this

This is something @alexcrichton would have to share, as I believe what I saw was a snippet from when he was diving into some benchmark's preformance on Pulley.

my preference is actually the opposite, toward better LICM rather than general peeling

You bring up very fair points. I'm definitely not set in my opinion.

I guess peeling+GVN feels clean to me because we "just" add a little peeling and then a bunch of desirable properties fall out "for free". In comparison, improving the LICM for this case is basically adding a bunch of new checks for special cases and otherwise-unshared, manual IR manipulation code.

The other complication with improving LICM is that we don't necessarily have a single block dominating the loop but also that will only be executed if the loop will iterate at least once. We require such a block for the LICM, lest we introduce partially redundant code. We could, of course, ensure that such a block exists, inserting one if it is missing, but this is the kind of one-off, otherwise-unshared, manual IR manipulation that I prefer to avoid when possible... It also has the same CFG complexity as peeling with regards to aegraph pass integration.

I agree that code bloat is the biggest concern with peeling, and I don't have any silver bullet there. This may indeed outweigh everything else.

cfallin · 2025-01-28T18:51:48Z

Hmm, yeah, we do need preheaders ("single block dominating the loop") for the hoisting to be guaranteed in an LICM world. I guess our LICM today is incomplete/"best effort" in this way today as well: if we don't have a place to hoist to then we won't hoist. I think we get the proper block separation as a side-effect of the Wasm-to-CLIF translation, so in practice this may not matter for Wasm workloads, but that's an emergent property rather than a well-modularized guarantee so I agree it doesn't feel great.

a bunch of new checks for special cases and otherwise-unshared, manual IR manipulation code.

I guess my hope at least had been that we would extend the instruction predicates we have currently -- "pure", "idempotent" and the like. But actually when hoisting a side-effect we need to know that it won't cross any other side-effects so that does involve a sort of ad-hoc scan of the loop body and reasoning about side-effect colors.

... and now I think I'm coming around to the peeling side at least for the trapz case -- it does indeed feel like the most principled way of reasoning about side-effect motion given the above. Well, the code-bloat is a quantitative question -- we can evaluate that. Maybe what we could focus on is coming up with stable ("cliff-free") heuristics -- e.g., always peel if conditional trap instructions are present. Maybe we could also build this in tandem with some sort of redundancy elimination -- duplicate-block merging? -- that could clean up cases where we didn't get any benefit.

fitzgen · 2025-01-28T19:34:59Z

I think we get the proper block separation as a side-effect of the Wasm-to-CLIF translation, so in practice this may not matter for Wasm workloads

I am not sure about our specific Wasm-to-CLIF translation either, but I do not believe that Wasm's loop construct provides this guarantee.

Maybe what we could focus on is coming up with stable ("cliff-free") heuristics -- e.g., always peel if conditional trap instructions are present.

FWIW, the case I am remembering might have actually been a trusted load of something in the vmctx (heap base?) and not actually a conditional trap instruction. I don't think that changes anything we've been discussing tho.

Anyways, another "cliff-free" (not to be confused with CLIF-free!) heuristic would be to always and only peel innermost loops, probably in combination with something like "and if there is a trusted load or an idempotent side-effectful instruction in the loop body". If we didn't want to invest in subsequent passes to clean up any accidental redundancy, we could refine the heuristics even further with things like "and the instruction we are targeting for LICM via peeling doesn't depend on a block parameter defined within the loop."

Maybe we could also build this in tandem with some sort of redundancy elimination -- duplicate-block merging? -- that could clean up cases where we didn't get any benefit.

That would indeed be cool, although more work also so I wouldn't argue that we should block landing a hypothetical peeling pass on this existing too, unless the numbers show that bloat is unacceptable. Are you imagining that this would do some kind of renumbering to dedupe block params as well? Like, would you expect it to clean up the following duplicate blocks?

block1(v10: i32):
  v11 = iadd_imm v10, 1
  store v11, v456
  jump block789

block2(v12: i32):
  v13 = iadd_imm v12, 1
  store v12, v456
  jump block789

I guess we can start simple and then add more complexity if needed.

fitzgen added cranelift Issues related to the Cranelift code generator cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Jan 24, 2025

fitzgen mentioned this issue Jan 25, 2025

Pulley Performance Tracking #10102

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: consider adding a loop-peeling optimization pass #10111

Cranelift: consider adding a loop-peeling optimization pass #10111

fitzgen commented Jan 24, 2025

fitzgen commented Jan 25, 2025

cfallin commented Jan 25, 2025

fitzgen commented Jan 28, 2025

cfallin commented Jan 28, 2025

fitzgen commented Jan 28, 2025

Cranelift: consider adding a loop-peeling optimization pass #10111

Cranelift: consider adding a loop-peeling optimization pass #10111

Comments

fitzgen commented Jan 24, 2025

fitzgen commented Jan 25, 2025

cfallin commented Jan 25, 2025

fitzgen commented Jan 28, 2025

cfallin commented Jan 28, 2025

fitzgen commented Jan 28, 2025