-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSE for UPLC #5634
CSE for UPLC #5634
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed everything apart from the actual algorithm. I'll create a ticket for myself to review it, since it's definitely going to take me a while. Great docs by the way.
|
||
Schnorr Signature G1 | ||
|
||
n Script size CPU usage Memory usage | ||
---------------------------------------------------------------------- | ||
- 425 (2.6%) 534466472 (5.3%) 265206 (1.9%) | ||
- 420 (2.6%) 344615074 (3.4%) 264670 (1.9%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow!
({cpu: 252572171 | ||
| mem: 884340}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So damn tedious to review those golden files. I wonder if we can do better somehow. Don't have any ideas though, there are so many changes here that even putting them into a single file and using --word-diff
wouldn't help much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the filenames said budget
somewhere we could probably write a shell script that you could locally pipe the git diff though to get some kind of summary.
({cpu: 536014965 | ||
| mem: 1909052}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3% improvement, not bad.
( ToBuiltinMeaning uni fun | ||
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's unrelated to your PR, but I bet Ord name
is for the wrong reason here. I'll take a quick look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this has been discussed before but hasn't been fixed.
@@ -2,38 +2,38 @@ | |||
1.1.0 | |||
[ | |||
[ | |||
(lam s_800 [ s_800 s_800 ]) | |||
(lam s_836 [ s_836 s_836 ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda weird that only uniques change in this file and nothing else. CSE performs an additional renaming?
@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts) | |||
caseOfCase' = case eqT @fun @DefaultFun of | |||
Just Refl -> caseOfCase | |||
Nothing -> id | |||
|
|||
cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a) | |||
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding Name
here is kinda weird. Any name should work as long as it has a TermUnique
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, or maybe we just need Hashable name
. In general, I think using Typeable
to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is needed for Hashable? 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation of CSE only works correctly for Name
. We can't just do the substitutions that we do for DeBruijn
terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about requiring HasUnique
? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about requiring HasUnique?
I also need freshName
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then just monomorphise to Name
and TyName
. The PIR compiler in fact runs on those types, and if that's the truth you might as well say it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer, and I was looking forward to generalize PIR to work with any name/tyname. But of course this i s noot the only place where we do fix to Name/TyName. Other places that I can think of is the PIR typechecker and the rewrite rules
UPD: no, sorry, not everything, I didn't review all the golden files as it's infeasible. |
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name | ||
, Typeable name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented on your other PR, I don't see why we can't just have a Hashable name
constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you are right; let me try and see if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name
. We can't just do the substitutions that we do for DeBruijn
terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasUnique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a clear win in terms of execution (cpu&mem), and has mostly a positive effect on the script size as well. Nice work!
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. | ||
- The builtin function arity information is approximate: rather than using the accurate arities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be easier to provide such arity information ourselves somewhere, for example, inside BuiltinsInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5634 (comment)
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of | ||
the candidate's path. The order is because a bigger expression may contain a small subexpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really matter? In the below example, wouldn't it lead to a similar result irrespective of the order of CSE'ing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves
But you do perform substitution in CSs? 1+(2+x)
is pulled out and then becomes 1+cse2
. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you do perform substitution in CSs?
No, what I meant to say: if we process 1+(2+x)
first, then we only need to perform substitution in the original term itself. On the other hand, if we process 2+x
first, we must also perform substitution in the CSs, i.e., turning 1+(2+x)
into 1+cse2
. Otherwise, when we process 1+(2+x)
later, we will not find any 2+x
in the original term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x)
in the RHS of the newly created binding for the first CSE candidate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean "the partially processed term with some candidates hoisted out"?
Yes.
of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a | ||
total of 16 times). | ||
|
||
Finally, since CSE can change the order or the number of occurrences of effects, it is only run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are certain builtins that are pure (don't throw). So the order or number of occurrences would not matter for those. I may be wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we do worry about log order.
plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs
Outdated
Show resolved
Hide resolved
combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList | ||
where | ||
go :: | ||
[(Path, (Term name uni fun (Path, ann), Int))] -> | ||
[(Path, (Term name uni fun (Path, ann), Int))] | ||
-- The new path is not a descendent-or-self of any existing path. | ||
go [] = [(path, (t, 1))] | ||
go ((path', (t', cnt)) : paths) | ||
-- The new path is an ancestor-or-self of an existing path. | ||
-- Take over all counts of the existing path, remove the existing path, | ||
-- and continue. | ||
| path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths | ||
-- The new path is a descendent-or-self of an existing path. | ||
-- Increment the count for the existing path. There can only be one such | ||
-- existing path, so we don't need to recurse here. | ||
| path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like going through List for the inner hashmap (fromlist . toList) when combining paths defeats the purpose of using a hashmap for this inner structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith
.
plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs
Outdated
Show resolved
Hide resolved
-- | The first pass. See Note [CSE]. | ||
calcBuiltinArity :: | ||
forall name uni fun ann. | ||
(Hashable fun) => | ||
Term name uni fun ann -> | ||
HashMap fun Int | ||
calcBuiltinArity = foldrOf termSubtermsDeep go Map.empty | ||
where | ||
go :: Term name uni fun ann -> HashMap fun Int -> HashMap fun Int | ||
go = \case | ||
t@Apply{} | ||
| (Builtin _ fun, args) <- splitApplication t -> | ||
Map.insertWith max fun (length args) | ||
_ -> id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I would rather have a hardcoded DefaultFun builtinarity :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Nikos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the algorithm description (thanks for that!) first, since that may well spur a bunch of discussion.
since those TH-generated field accessors are not CSE-friendly
Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.
@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts) | |||
caseOfCase' = case eqT @fun @DefaultFun of | |||
Just Refl -> caseOfCase | |||
Nothing -> id | |||
|
|||
cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a) | |||
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, or maybe we just need Hashable name
. In general, I think using Typeable
to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications | ||
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a big deal? I think that means we can never CSE any functions together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply compute the hash after renaming the expression via runQuote . rename
so that the uniques for bound variables start with 0
? Haven't reviewed the algorithm description yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this implementation, it doesn't CSE work-free expressions anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alpha-renaming may perhaps bring a huge benefit. But personally, I am ok to bring it later on a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this implementation, it doesn't CSE work-free expressions anyway.
Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.
(1) what about non-closed terms
What about them?
(2) substitution is also harder
We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something
Didn't you write the typechecker? :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically rewrote it, not just wrote it, and I was opposed to using global uniqueness there from before the beginning:
It took me a ridiculous amount of time to weed all the bugs out of the original implementation and to this day I still don't understand why we need global uniqueness there instead of having type values at the type level and reducing them using an abstract machine or whatever like people do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that I forgot to comment on:
At least in this implementation, it doesn't CSE work-free expressions anyway.
Such an expression isn't necessarily work-free! You may have f (\x -> x)
for example.
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. | ||
- The builtin function arity information is approximate: rather than using the accurate arities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at the code but this surprises me: we have all the machinery for this (in PIR at least)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that complicated IMO. It is a little fiddly because we need to pass BuiltinsInfo
around, but that's very mechanical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Simplifications | ||
------------------------------------------------------------------------------- | ||
|
||
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you use any references for the implementation? Is this a standard algorithm? What variants do we know of? What does GHC do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.
I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData
can be more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I think we might want to transfer a bit more of that reasoning into the notes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll expand the notes a bit
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need one for a case branch? A case branch can be an arbitrary term, and if it's a function then there will be a lambda inside it, which you will give an id to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm missing the "abstract picture". What are we trying to do here? Identify unique scopes? or what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,
- If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE
- On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are maybe three answers?
- They represent places where computation stops, i.e. subexpressions are not immediately evaluated, and as such they shouldn't be hoisted above that point lest they be evaluated too soon.
- They represent places where evaluation may happen multiple times with a different environment (lambdas) or not (delays) i.e. subexpressions may not always be evaluated to the same thing so shouldn't be shared between multiple evaluations.
- They may bind new variables, in which case it's actually incorrect to hoist a subexpression out of one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.
I guess the problem is the related notions of "common sub-expression" and "structurally identical term". In the presence of non-globally-unique names you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?
Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And write something!!
|
||
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we do a substitution pass for every CSE candidate? Sounds quadratic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you go over the whole program building up a single subtitution, but applying the substitution to new substitution-rhss that you insert. Maybe that's also quadratic, though! It seems less quadratic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is quadratic only in very pathological cases; and as such I agree we can leave it as it is in the initial version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a pathological test case to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a pathological test case to see?
Something like
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
+
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
? Do you have a worse pathological case in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks like a good example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a much smaller test case in 6b9e944.
It's not about CSE, but our test infrastructure simply can't handle anything close to the above term.
Even for a much smaller one, e.g., (1+2) + (3+4) + ... + (99999 + 100000)
(i.e., more than 20 times smaller than the above term), the test times out without doing CSE or anything else!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test times out without doing CSE or anything else!
Could you please create a ticket for that just so we have it recorded?
) | ||
|
||
In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence | ||
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? I think the answer is:
- There is no scope enclosing both of them? If so, this is probably a bug, there should be a scope for "the top level" for sure.
- They have differing paths, corresponding to the two different delays. In this instance, since the scopes come from
delay
s, you could share the value across both (since there isn't a newly introduced variable that they might depend on). However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.
So, not at all obvious. And I don't know how to reason about it because you haven't told me what your algorithm is trying to do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.
Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it will definitely be computed.
But we only know that because we know what ifThenElse
does. The compiler doesn't have this kind of info, for all it knows ifThenElse
takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delay
s may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't assuming we knew it would definitely be computed, just that we could lift it anyway (and then get computation too soon).
total of 16 times). | ||
|
||
Finally, since CSE can change the order or the number of occurrences of effects, it is only run | ||
when conservative optimization is off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we make it conservative by only CSE-ing pure expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can do that later as an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to some of the comments. Will address the other comments later.
) | ||
|
||
In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence | ||
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.
Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.
of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a | ||
total of 16 times). | ||
|
||
Finally, since CSE can change the order or the number of occurrences of effects, it is only run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off
total of 16 times). | ||
|
||
Finally, since CSE can change the order or the number of occurrences of effects, it is only run | ||
when conservative optimization is off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can do that later as an improvement.
combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList | ||
where | ||
go :: | ||
[(Path, (Term name uni fun (Path, ann), Int))] -> | ||
[(Path, (Term name uni fun (Path, ann), Int))] | ||
-- The new path is not a descendent-or-self of any existing path. | ||
go [] = [(path, (t, 1))] | ||
go ((path', (t', cnt)) : paths) | ||
-- The new path is an ancestor-or-self of an existing path. | ||
-- Take over all counts of the existing path, remove the existing path, | ||
-- and continue. | ||
| path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths | ||
-- The new path is a descendent-or-self of an existing path. | ||
-- Increment the count for the existing path. There can only be one such | ||
-- existing path, so we don't need to recurse here. | ||
| path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith
.
plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs
Outdated
Show resolved
Hide resolved
1. Simplifications | ||
------------------------------------------------------------------------------- | ||
|
||
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.
I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData
can be more performant.
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications | ||
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this implementation, it doesn't CSE work-free expressions anyway.
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. | ||
- The builtin function arity information is approximate: rather than using the accurate arities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.
The field accessors from pattern synonyms are less inlined, and thus less CSE-friendly.
-- Manual
(\d -> (\tup -> ...something... (force (force fstPair) tup))
(unConstrData d)
+
(\tup -> ...something_else... (force (force fstPair) tup))
(unConstrData d)
)
-- From pattern synonyms
(\d -> (\f -> f something + f something_else)
(\tup x -> ...x... (force (force fstPair) tup) (unConstrData d))
)
In the latter, f
is not inlined, and as a result, unConstrData d
is not a common subexpression, and has to be evaluated multiple times.
GHC doesn't have this problem since it can perform full-laziness, but ours is not a lazy language and we can't do that (unless in some very clever way).
Also I verified that the improvements are due to CSE, not the additional simplifier passes. Increasing the number of simplifier passes from 12 to 16 alone has no effect whatsoever.
( ToBuiltinMeaning uni fun | ||
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this has been discussed before but hasn't been fixed.
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name | ||
, Typeable name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you are right; let me try and see if it works.
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications | ||
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. | ||
- The builtin function arity information is approximate: rather than using the accurate arities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5634 (comment)
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,
- If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE
- On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of | ||
the candidate's path. The order is because a bigger expression may contain a small subexpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves
And this makes me think that we should probably make the inliner "dumber" (that is, inline more things, to expose opportunities for CSE). |
There are a lot of comments on this already so maybe I'll wait until those have been addressed before having a proper look. |
Rogue thumb. |
I also checked and I think it's not quite what you wrote. With the pattern synonym version, all the field accessors use the "matching function" ( |
See #5636 (comment) |
By the way, I retract what I said about "CSE making the compilation of marlowe validator much slower". The marlowe validator already takes about 5s to compile on master on my laptop, and it takes a similar amount of time with CSE (I didn't measure precisely, though) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the description so far. Amazing docs, thank you. I need to meditate on it for a bit.
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications | ||
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this implementation, it doesn't CSE work-free expressions anyway.
Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.
(1) what about non-closed terms
What about them?
(2) substitution is also harder
We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.
- The builtin function arity information is approximate: rather than using the accurate arities, | ||
we simply use the maximum number of arguments applied to a builtin function in the program | ||
as the builtin function's arity. The arity information is used to determine whether a builtin | ||
application is possibly saturated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds so hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).
In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a | ||
non-workfree term, and `path` is its path. If the same term has two paths, and one is an | ||
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both | ||
instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is path
a key in that map rather than a value (in a list)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to aggregate the counts on each path via Map.fromListWith
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, still don't get it. Can you have a map from hashes of terms to lists of paths, without even any counts, and then post-process each list of paths? If you have 2+x :-> ["0.1", "0.1.2", "0.1.3"]
, then during post-processing you figure out that "0.1"
is the root and insert a binding at that point, ignoring all the path extensions. Technically, it's not even a list of paths as a value, it's a trie of paths and "post-processing" means simply taking the root of each forest at the beginning of the trie (unless the forest consists only of the root, i.e. the subexpression only occurs once).
UPD: as per my other comment, you don't even need to have a trie there. It's just a list of paths with each path not being an extension of another. When inserting you make sure that the path being inserted isn't an extension of an existing path and if it is, then the shortest one is chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example. Suppose for 2+x
, we currently have the following counts:
"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30
Now we process path "0.1". After processing it, the counts will become:
"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1
Finally we use Map.fromListWith
to aggregate them, and get
"0.1" -> 31, "0.2.2" -> 30
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that make sense?
Yes, it's what I was suggesting as well. I just though you kept the old paths too. I suppose I misread the documentation, but maybe you could make it clear there that the old paths aren't kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually paths indeed don't need to be map keys... I'll update the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner HashMap (with Path as key) is replaced with a list in b32cd3c, because in my example above, we can easily go from step 1 directly to step 3, bypassing step 2.
|
||
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.
, (2+x) + (3+x) | ||
, 4+x | ||
] | ||
) (1+(2+x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this demonstrates a deficiency with the approach. You're creating an extra lambda and then you dive inside of it and pull a subexpression out without really accounting for the lambda (or at least that's what I got from the description of the algorithm). Which is entirely fine! Because the lambda is an immediately applied one, it's always going to be evaluated, hence we can simply ignore it. But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems a little tricky. What if the new CSE term now includes cse1
? The term below hoists cse2
outside the scope of cse1
! At least it should probably be inside. I notice the the examples are inconsistent: below there is an example where we do put cse2
inside, which seems more correct.
I guess that if we do that it's safe, but the argument is non-trivial:
- Suppose we have two CSE candidates
cse1
andcse2
, wherecse1
occurs withincse2
. - Then if we decide to hoist
cse1
, we must do so to the level indicated by the path tocse2
, sincecse1
is a subexpression ofcse2
- We thus in theory need to extend the path for
cse2
by one step for the new binding, but since it occurs right under the previous last step, we can just use the old path so long as we put the binding forcse2
under the other CSE bindings.
I'm not even sure if that is correct. I hope we have some tests for this and we need a careful argument!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term below hoists cse2 outside the scope of cse1! At least it should probably be inside.
Seems correct to me? Not sure which term you're referring too. I've been ignoring indices in the names, only looking at semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?
Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).
Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.
That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).
I don't believe it's going to fly. Plutus Core programs tend to have a long prefix of immediately applied lambdas, are you going to extract half a program outside those by picking every single subexpression that doesn't happen to reference some of the variables? Plus it'll create bindings for things that we don't really want to create bindings for, like what's the point of turning (\x -> 1 + 2 + x) e
into (\y -> (\x -> y + x) e) (1 + 2)
-- no work is saved and some evaluation cycles are wasted on creating a variable. And each such transformation only creates more immediately applied lambdas.
I suspect that full-laziness-like transformations only help when you lift a subexpression past a non-applied lambda. CSE is different in that regard: it helps whenever a subexpression occurs more than one time, regardless of whether you're inside an applied lambda or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.
That's a bit subtle and needs to be written down. I had to think about it for a while. Here's my argument:
By processing CSE candidates in size order, we also implicitly process them in dependency order: if we have two CSE candidates cse1
and cse2
, then we cannot end up in a situation where cse2
would appear in cse1
, because then the original cse1
would have had to have been bigger than cse2
.
I'm actually not sure this is a sound argument, because doing some rounds of CSE can change the size ordering. Can we end up with a situation like this?
cse1
<cse2
<cse3
- Pull out
cse1
cse3
<cse2
, andcse3
appears a subexpression ofcse2
- Pull out
cse2
- Pull out
cse3
, break the RHS ofcse2
Looking at that, in fact I think what we need to be safe is just:
- Bindings for subsequent CSE candidates go inside the previous ones
- We only substitute for a CSE candidate inside the scope of the binding
Which is much simpler and probably holds.
That still implies that this example is wrong and cse2
should be bound inside the scope of cse1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelpj I don't get your point at all. What do you think is the correct output for this example?
cse2
is smaller than cse1
. Shouldn't the smaller one be in the outer scope? I.e., you'd want
let cse2 = 2 + x in
let cse1 = 1 + cse2 in body
If you flip it you'd get
let cse1 = 1 + (2 + x) in
let cse2 = 2 + x in body
Can we end up with a situation like this?
In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I just had it backwards. You're doing them in descending order of size 🤦
Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of | ||
the candidate's path. The order is because a bigger expression may contain a small subexpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves
But you do perform substitution in CSs? 1+(2+x)
is pulled out and then becomes 1+cse2
. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.
) | ||
|
||
In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence | ||
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it will definitely be computed.
But we only know that because we know what ifThenElse
does. The compiler doesn't have this kind of info, for all it knows ifThenElse
takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delay
s may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I appreciate the long explanation at the beginning.
I would definitely use the real builtin arity. Calculating it in this way seems messy and adds unnecessary complexity.
The path looks like a good abstraction for the boundaries where CSE may happen. I wonder if having an explicit transformation into a tree of top level, LamAbs, Delay, and Case nodes would make the correctness easier to see, but I guess that reasoning about prefixes is easy enough.
of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a | ||
total of 16 times). | ||
|
||
Finally, since CSE can change the order or the number of occurrences of effects, it is only run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?
- The builtin function arity information is approximate: rather than using the accurate arities, | ||
we simply use the maximum number of arguments applied to a builtin function in the program | ||
as the builtin function's arity. The arity information is used to determine whether a builtin | ||
application is possibly saturated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed/responded to some comments.
we made are: | ||
|
||
- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions. | ||
- The builtin function arity information is approximate: rather than using the accurate arities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The builtin function arity information is approximate: rather than using the accurate arities, | ||
we simply use the maximum number of arguments applied to a builtin function in the program | ||
as the builtin function's arity. The arity information is used to determine whether a builtin | ||
application is possibly saturated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659
In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a | ||
non-workfree term, and `path` is its path. If the same term has two paths, and one is an | ||
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both | ||
instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to aggregate the counts on each path via Map.fromListWith
.
Now, each `(term, path)` pair whose count is greater than 1 is a CSE candidate. | ||
In the above example, the CSE candidates are `(2+x, "0.1")` and `(1+(2+x), "0.1")`. | ||
Note that `3+x` is not a CSE candidate, because it has two paths, and neither has a count | ||
greater than 1. `2+` is also not a CSE candidate, because it is workfree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement a version of CSE that optimizes for size in the future. Some users to value size more than cost.
, (2+x) + (3+x) | ||
, 4+x | ||
] | ||
) (1+(2+x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?
Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).
Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.
That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name | ||
, Typeable name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name
. We can't just do the substitutions that we do for DeBruijn
terms.
@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts) | |||
caseOfCase' = case eqT @fun @DefaultFun of | |||
Just Refl -> caseOfCase | |||
Nothing -> id | |||
|
|||
cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a) | |||
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation of CSE only works correctly for Name
. We can't just do the substitutions that we do for DeBruijn
terms.
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.
I've spent some time pondering the general algorithm and I think what this PR does is possible to do better performance-wise. I don't kwow it we care about that though, maybe for a typical Plutus program the current algorithm is just fine, so feel free to ignore. You check whether one path is an extension of another and do different things depending on the answer, but I think we can avoid this logic entirely. We're going to construct a DAG where each node is a \x y -> (1+(2+x))
+
(case y [ (1+(2+x)) + (3+x)
, (2+x) + (3+x)
, 4+x
]
) then Similarly, Each time you enter a new scope, you start from an empty The actual implementation may avoid materializing DAGs and just add some in-place logic doing the same. UPD: well, I guess it's not even a DAG, just a rose tree. Fundamentally, the idea of your algorithm is reader-based, but you handle it with a state-based implementation and then you need to check path prefixes to account for the extra power of the implementation that you don't use. But as per other comments, maybe that's fine, maybe we want a more general algorithm, so that in future we can optimize not only for evaluation costs, but also for the size of the script, which requires pulling common subexpressions occurring at different paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to more comments.
, (2+x) + (3+x) | ||
, 4+x | ||
] | ||
) (1+(2+x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelpj I don't get your point at all. What do you think is the correct output for this example?
cse2
is smaller than cse1
. Shouldn't the smaller one be in the outer scope? I.e., you'd want
let cse2 = 2 + x in
let cse1 = 1 + cse2 in body
If you flip it you'd get
let cse1 = 1 + (2 + x) in
let cse2 = 2 + x in body
Can we end up with a situation like this?
In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.
|
||
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a pathological test case to see?
Something like
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
+
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
? Do you have a worse pathological case in mind?
In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a | ||
non-workfree term, and `path` is its path. If the same term has two paths, and one is an | ||
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both | ||
instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example. Suppose for 2+x
, we currently have the following counts:
"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30
Now we process path "0.1". After processing it, the counts will become:
"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1
Finally we use Map.fromListWith
to aggregate them, and get
"0.1" -> 31, "0.2.2" -> 30
Does that make sense?
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?
Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.
How exactly does this work? Say a common subexpression
My implementation can be slightly improved asymptotically: currently for each expression, it is |
1. Simplifications | ||
------------------------------------------------------------------------------- | ||
|
||
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I think we might want to transfer a bit more of that reasoning into the notes?
The implementation makes several passes on the given term. The first pass collects builtin | ||
arity information as described above. | ||
|
||
In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And write something!!
|
||
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks like a good example!
, (2+x) + (3+x) | ||
, 4+x | ||
] | ||
) (1+(2+x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I just had it backwards. You're doing them in descending order of size 🤦
Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of | ||
the candidate's path. The order is because a bigger expression may contain a small subexpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x)
in the RHS of the newly created binding for the first CSE candidate, right?
@@ -226,6 +226,35 @@ multiApp = runQuote $ do | |||
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3] | |||
pure app | |||
|
|||
-- | This is the first example in Note [CSE]. | |||
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we've discussed quite a number of examples so far - this should have more than two tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have a property test that cse term
is alpha-beta-equal to term
. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?
I do agree with Michael we should have more golden tests here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the third example in the Note, as well as the pathological example.
When you reach a leaf you record that
Yes, what I propose is one way to check only ancestors without iterating through all possible paths. I used a DAG for the model to be "tangible" (easier to think about, easier to explain), but if you do it directly, then that'll be great too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks somewhat intimidating, but it's actually surprisingly straightforward. I've reviewed everything and it looks good to me, even though it's not optimal as we've discussed a lot here, but maybe it's efficient enough and we don't care about optimizing the whole thing.
I may have another run to reconcile the description of the algorithm with the implementation, but that shouldn't block the PR.
addToMap :: | ||
Term name uni fun (Path, ann) -> | ||
HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int)) -> | ||
HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so you never actually put Term
and Path
together as a single key. I got confused about that when reading the description of the algorithm. It's probably me, but maybe you can make it more clear there.
@@ -226,6 +226,35 @@ multiApp = runQuote $ do | |||
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3] | |||
pure app | |||
|
|||
-- | This is the first example in Note [CSE]. | |||
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have a property test that cse term
is alpha-beta-equal to term
. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?
I do agree with Michael we should have more golden tests here though.
-- `isWorkFree` currently doesn't check whether a builtin application is saturated, | ||
-- or whether an term is the (possibly repeated) forcing of a builtin (which should | ||
-- be workfree), so we check it separately. | ||
| isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing isn't workfree! We have deferred unlifting, but not deferred interleaving (see this ticket). It does make sense to pull force someBuiltin
out, since it does actually compute, and we seem to have given up on the deferred interleaving idea. I think Plutonomy had an optimization of this kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this comment is an important one. Perhaps you can even remove the entire first pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin
multiple times, but on the other hand, we evaluate LamAbs
and Apply
more times, which is more expensive!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we evaluate
LamAbs
andApply
more times, which is more expensive!
How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.
No, it's because when we CSE something, it creates an additional LamAbs and Apply.
plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied to a few new review comments.
1. Simplifications | ||
------------------------------------------------------------------------------- | ||
|
||
This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll expand the notes a bit
In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a | ||
non-workfree term, and `path` is its path. If the same term has two paths, and one is an | ||
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both | ||
instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually paths indeed don't need to be map keys... I'll update the code
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE | ||
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute | ||
it for all occurrences in the original term whose paths are descendents (or self) of | ||
the candidate's path. The order is because a bigger expression may contain a small subexpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean "the partially processed term with some candidates hoisted out"?
Yes.
-- `isWorkFree` currently doesn't check whether a builtin application is saturated, | ||
-- or whether an term is the (possibly repeated) forcing of a builtin (which should | ||
-- be workfree), so we check it separately. | ||
| isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin
multiple times, but on the other hand, we evaluate LamAbs
and Apply
more times, which is more expensive!
@@ -226,6 +226,35 @@ multiApp = runQuote $ do | |||
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3] | |||
pure app | |||
|
|||
-- | This is the first example in Note [CSE]. | |||
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the third example in the Note, as well as the pathological example.
I didn't end up doing it, because (1) it isn't enough to check for ancestors, but we also need to check for descendants; (2) in some cases, such as the pathological example I added, the number of paths per expression (two) is much smaller than the depth of the tree, so
I see. Time complexity wise, for each subexpression |
({cpu: 9927795 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only we could do something like this to the Marlowe validators!
@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts) | |||
caseOfCase' = case eqT @fun @DefaultFun of | |||
Just Refl -> caseOfCase | |||
Nothing -> id | |||
|
|||
cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a) | |||
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about requiring HasUnique
? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.
, MonadQuote m | ||
, HasUnique name TermUnique | ||
, Ord name | ||
, Typeable name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasUnique?
addInteger | ||
(addInteger | ||
(addInteger | ||
(addInteger (addInteger cse cse) cse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is going on here? We're not CSE-ing single variables are we? That would definitely not be a win...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not CSE-ing single variables are we?
I had it written in my notes to check specifically for that and then failed to actually do that, it's been a while since I felt like a complete moron so much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this seems like a bug, since it applies CSE on partially applied addInteger
, which it isn't supposed to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not a bug - the partially applied addInteger
is in fact not a builtin..
(\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error))) | ||
(int d)) | ||
(\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error))) | ||
(\scrut cont fail -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually very nice: we CSE $mInts ds
, i.e. the partial application.
Oh yeah, my approval only holds as long as |
See Note [CSE] in
Cse.hs
for how it works.This currently doesn't help
AsData
's field accessors much, since those TH-generated field accessors are not CSE-friendly. I added a test case with manually written field accessors (SeerecordFieldsManual
inAsData.Budget.Spec
), and its performance is now comparable withpatternMatching
. We can easily generate those more efficient field accessors instead.Besides
recordFieldsManual
, two unit tests are added in plutus-core/untyped-plutus-core/test/Transform/Simplify.hs.