-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: consider adding a loop-peeling optimization pass #10111
Comments
If we don't care about cleaning up conditional code that only happens in the first iteration of the loop, and can statically be shown as such after the peeling, and we only care about being able to LICM side-effectful-but-idempotent instructions, then we could also just investigate extending aegraph elaboration to support LICM'ing the idempotent parts of the side-effectful skeleton when that wouldn't move one side-effectful instruction across another one. But that kinda feels a little too specific and one-off for my tastes, in comparison to the peeling which effectively gives us that and also some more stuff all at once (ie is more general/powerful and probably not very different in terms of engineering effort?). |
I'd be interested in working through some example IR you've see to motivate this -- the main reason being that my preference is actually the opposite, toward better LICM rather than general peeling. The main reasons are that:
In contrast, "better LICM" feels much more tractable to me: it doesn't require bulk CFG manipulation (only instruction placement/scheduling which we're already doing) and gets the same concrete benefits on e.g. repeated trap checks without paying the large associated cost (duplicating the loop body). |
This is something @alexcrichton would have to share, as I believe what I saw was a snippet from when he was diving into some benchmark's preformance on Pulley.
You bring up very fair points. I'm definitely not set in my opinion. I guess peeling+GVN feels clean to me because we "just" add a little peeling and then a bunch of desirable properties fall out "for free". In comparison, improving the LICM for this case is basically adding a bunch of new checks for special cases and otherwise-unshared, manual IR manipulation code. The other complication with improving LICM is that we don't necessarily have a single block dominating the loop but also that will only be executed if the loop will iterate at least once. We require such a block for the LICM, lest we introduce partially redundant code. We could, of course, ensure that such a block exists, inserting one if it is missing, but this is the kind of one-off, otherwise-unshared, manual IR manipulation that I prefer to avoid when possible... It also has the same CFG complexity as peeling with regards to aegraph pass integration. I agree that code bloat is the biggest concern with peeling, and I don't have any silver bullet there. This may indeed outweigh everything else. |
Hmm, yeah, we do need preheaders ("single block dominating the loop") for the hoisting to be guaranteed in an LICM world. I guess our LICM today is incomplete/"best effort" in this way today as well: if we don't have a place to hoist to then we won't hoist. I think we get the proper block separation as a side-effect of the Wasm-to-CLIF translation, so in practice this may not matter for Wasm workloads, but that's an emergent property rather than a well-modularized guarantee so I agree it doesn't feel great.
I guess my hope at least had been that we would extend the instruction predicates we have currently -- "pure", "idempotent" and the like. But actually when hoisting a side-effect we need to know that it won't cross any other side-effects so that does involve a sort of ad-hoc scan of the loop body and reasoning about side-effect colors. ... and now I think I'm coming around to the peeling side at least for the |
I am not sure about our specific Wasm-to-CLIF translation either, but I do not believe that Wasm's
FWIW, the case I am remembering might have actually been a trusted load of something in the vmctx (heap base?) and not actually a conditional trap instruction. I don't think that changes anything we've been discussing tho. Anyways, another "cliff-free" (not to be confused with CLIF-free!) heuristic would be to always and only peel innermost loops, probably in combination with something like "and if there is a trusted load or an idempotent side-effectful instruction in the loop body". If we didn't want to invest in subsequent passes to clean up any accidental redundancy, we could refine the heuristics even further with things like "and the instruction we are targeting for LICM via peeling doesn't depend on a block parameter defined within the loop."
That would indeed be cool, although more work also so I wouldn't argue that we should block landing a hypothetical peeling pass on this existing too, unless the numbers show that bloat is unacceptable. Are you imagining that this would do some kind of renumbering to dedupe block params as well? Like, would you expect it to clean up the following duplicate blocks?
I guess we can start simple and then add more complexity if needed. |
That is, a pass that unrolls one iteration of the loop body just before the actual loop. This is often beneficial because loop bodies might contain conditional code that only executes on the first iteration.
There could be additional benefit for Cranelift in particular: right now, Cranelift will deduplicate various kinds side-effecting-but-idempotent instructions (such as
trapz
s) but will not move those instructions outside of their original blocks and hoist them just above the loop even when it would be valid to do so. But if we have already unrolled the first iteration of a loop, then that unrolled first iteration will dominate the rest of the loop and when we dedupe those instructions from inside the loop body to their unrolled counterparts, we are effectively doing exactly that code motion we wanted to do before but could not.Random tidbits:
Could we do this at the same time as the aegraphs pass? I think so? But it would be in the same traversal, not actually in ISLE, similar to eg the alias analysis optimization.
We will need some heuristics for when to peel loops or not. Don't think we want the code bloat of peeling every single loop in the whole program. Maybe just innermost loops? Maybe just loops of up to a certain size? Lots of experimenting to be done in this regard.
The text was updated successfully, but these errors were encountered: