Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSE for UPLC #5634

Merged
merged 12 commits into from
Dec 8, 2023
Merged

CSE for UPLC #5634

merged 12 commits into from
Dec 8, 2023

Conversation

zliu41
Copy link
Member

@zliu41 zliu41 commented Nov 14, 2023

See Note [CSE] in Cse.hs for how it works.

This currently doesn't help AsData's field accessors much, since those TH-generated field accessors are not CSE-friendly. I added a test case with manually written field accessors (See recordFieldsManual in AsData.Budget.Spec), and its performance is now comparable with patternMatching. We can easily generate those more efficient field accessors instead.

Besides recordFieldsManual, two unit tests are added in plutus-core/untyped-plutus-core/test/Transform/Simplify.hs.

Copy link
Contributor

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything apart from the actual algorithm. I'll create a ticket for myself to review it, since it's definitely going to take me a while. Great docs by the way.


Schnorr Signature G1

n Script size CPU usage Memory usage
----------------------------------------------------------------------
- 425 (2.6%) 534466472 (5.3%) 265206 (1.9%)
- 420 (2.6%) 344615074 (3.4%) 264670 (1.9%)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow!

({cpu: 252572171
| mem: 884340})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So damn tedious to review those golden files. I wonder if we can do better somehow. Don't have any ideas though, there are so many changes here that even putting them into a single file and using --word-diff wouldn't help much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the filenames said budget somewhere we could probably write a shell script that you could locally pipe the git diff though to get some kind of summary.

({cpu: 536014965
| mem: 1909052})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3% improvement, not bad.

( ToBuiltinMeaning uni fun
, MonadQuote m
, HasUnique name TermUnique
, Ord name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's unrelated to your PR, but I bet Ord name is for the wrong reason here. I'll take a quick look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this has been discussed before but hasn't been fixed.

@@ -2,38 +2,38 @@
1.1.0
[
[
(lam s_800 [ s_800 s_800 ])
(lam s_836 [ s_836 s_836 ])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda weird that only uniques change in this file and nothing else. CSE performs an additional renaming?

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
caseOfCase' = case eqT @fun @DefaultFun of
Just Refl -> caseOfCase
Nothing -> id

cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding Name here is kinda weird. Any name should work as long as it has a TermUnique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, or maybe we just need Hashable name. In general, I think using Typeable to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is needed for Hashable? 🤷‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about requiring HasUnique? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about requiring HasUnique?

I also need freshName...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then just monomorphise to Name and TyName. The PIR compiler in fact runs on those types, and if that's the truth you might as well say it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer, and I was looking forward to generalize PIR to work with any name/tyname. But of course this i s noot the only place where we do fix to Name/TyName. Other places that I can think of is the PIR typechecker and the rewrite rules

@effectfully
Copy link
Contributor

UPD: no, sorry, not everything, I didn't review all the golden files as it's infeasible.

, MonadQuote m
, HasUnique name TermUnique
, Ord name
, Typeable name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented on your other PR, I don't see why we can't just have a Hashable name constraint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you are right; let me try and see if it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasUnique?

Copy link
Contributor

@bezirg bezirg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a clear win in terms of execution (cpu&mem), and has mostly a positive effect on the script size as well. Nice work!

we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
- The builtin function arity information is approximate: rather than using the accurate arities,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be easier to provide such arity information ourselves somewhere, for example, inside BuiltinsInfo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +84 to +87
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
the candidate's path. The order is because a bigger expression may contain a small subexpression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really matter? In the below example, wouldn't it lead to a similar result irrespective of the order of CSE'ing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

But you do perform substitution in CSs? 1+(2+x) is pulled out and then becomes 1+cse2. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you do perform substitution in CSs?

No, what I meant to say: if we process 1+(2+x) first, then we only need to perform substitution in the original term itself. On the other hand, if we process 2+x first, we must also perform substitution in the CSs, i.e., turning 1+(2+x) into 1+cse2. Otherwise, when we process 1+(2+x) later, we will not find any 2+x in the original term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x) in the RHS of the newly created binding for the first CSE candidate, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "the partially processed term with some candidates hoisted out"?

Yes.

of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
total of 16 times).

Finally, since CSE can change the order or the number of occurrences of effects, it is only run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are certain builtins that are pure (don't throw). So the order or number of occurrences would not matter for those. I may be wrong though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do worry about log order.

Comment on lines 322 to 337
combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList
where
go ::
[(Path, (Term name uni fun (Path, ann), Int))] ->
[(Path, (Term name uni fun (Path, ann), Int))]
-- The new path is not a descendent-or-self of any existing path.
go [] = [(path, (t, 1))]
go ((path', (t', cnt)) : paths)
-- The new path is an ancestor-or-self of an existing path.
-- Take over all counts of the existing path, remove the existing path,
-- and continue.
| path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths
-- The new path is a descendent-or-self of an existing path.
-- Increment the count for the existing path. There can only be one such
-- existing path, so we don't need to recurse here.
| path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like going through List for the inner hashmap (fromlist . toList) when combining paths defeats the purpose of using a hashmap for this inner structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith.

Comment on lines +219 to +232
-- | The first pass. See Note [CSE].
calcBuiltinArity ::
forall name uni fun ann.
(Hashable fun) =>
Term name uni fun ann ->
HashMap fun Int
calcBuiltinArity = foldrOf termSubtermsDeep go Map.empty
where
go :: Term name uni fun ann -> HashMap fun Int -> HashMap fun Int
go = \case
t@Apply{}
| (Builtin _ fun, args) <- splitApplication t ->
Map.insertWith max fun (length args)
_ -> id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would rather have a hardcoded DefaultFun builtinarity :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Nikos.

plutus-tx-plugin/plutus-tx-plugin.cabal Outdated Show resolved Hide resolved
plutus-benchmark/plutus-benchmark.cabal Outdated Show resolved Hide resolved
Copy link
Contributor

@michaelpj michaelpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the algorithm description (thanks for that!) first, since that may well spur a bunch of discussion.

since those TH-generated field accessors are not CSE-friendly

Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
caseOfCase' = case eqT @fun @DefaultFun of
Just Refl -> caseOfCase
Nothing -> id

cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, or maybe we just need Hashable name. In general, I think using Typeable to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a big deal? I think that means we can never CSE any functions together?

Copy link
Contributor

@effectfully effectfully Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply compute the hash after renaming the expression via runQuote . rename so that the uniques for bound variables start with 0? Haven't reviewed the algorithm description yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this implementation, it doesn't CSE work-free expressions anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alpha-renaming may perhaps bring a huge benefit. But personally, I am ok to bring it later on a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this implementation, it doesn't CSE work-free expressions anyway.

Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.

(1) what about non-closed terms

What about them?

(2) substitution is also harder

We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something

Didn't you write the typechecker? :p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically rewrote it, not just wrote it, and I was opposed to using global uniqueness there from before the beginning:

Screenshot from 2023-11-16 11-45-15

It took me a ridiculous amount of time to weed all the bugs out of the original implementation and to this day I still don't understand why we need global uniqueness there instead of having type values at the type level and reducing them using an abstract machine or whatever like people do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that I forgot to comment on:

At least in this implementation, it doesn't CSE work-free expressions anyway.

Such an expression isn't necessarily work-free! You may have f (\x -> x) for example.

we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
- The builtin function arity information is approximate: rather than using the accurate arities,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at the code but this surprises me: we have all the machinery for this (in PIR at least)!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that complicated IMO. It is a little fiddly because we need to pass BuiltinsInfo around, but that's very mechanical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Simplifications
-------------------------------------------------------------------------------

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you use any references for the implementation? Is this a standard algorithm? What variants do we know of? What does GHC do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.

I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData can be more performant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I think we might want to transfer a bit more of that reasoning into the notes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll expand the notes a bit

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need one for a case branch? A case branch can be an arbitrary term, and if it's a function then there will be a lambda inside it, which you will give an id to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm missing the "abstract picture". What are we trying to do here? Identify unique scopes? or what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,

  • If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE
  • On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are maybe three answers?

  1. They represent places where computation stops, i.e. subexpressions are not immediately evaluated, and as such they shouldn't be hoisted above that point lest they be evaluated too soon.
  2. They represent places where evaluation may happen multiple times with a different environment (lambdas) or not (delays) i.e. subexpressions may not always be evaluated to the same thing so shouldn't be shared between multiple evaluations.
  3. They may bind new variables, in which case it's actually incorrect to hoist a subexpression out of one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

I guess the problem is the related notions of "common sub-expression" and "structurally identical term". In the presence of non-globally-unique names you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And write something!!


The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do a substitution pass for every CSE candidate? Sounds quadratic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you go over the whole program building up a single subtitution, but applying the substitution to new substitution-rhss that you insert. Maybe that's also quadratic, though! It seems less quadratic...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is quadratic only in very pathological cases; and as such I agree we can leave it as it is in the initial version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a pathological test case to see?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a pathological test case to see?

Something like

((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
+
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))

? Do you have a worse pathological case in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks like a good example!

Copy link
Member Author

@zliu41 zliu41 Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a much smaller test case in 6b9e944.

It's not about CSE, but our test infrastructure simply can't handle anything close to the above term.
Even for a much smaller one, e.g., (1+2) + (3+4) + ... + (99999 + 100000) (i.e., more than 20 times smaller than the above term), the test times out without doing CSE or anything else!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test times out without doing CSE or anything else!

Could you please create a ticket for that just so we have it recorded?

)

In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? I think the answer is:

  1. There is no scope enclosing both of them? If so, this is probably a bug, there should be a scope for "the top level" for sure.
  2. They have differing paths, corresponding to the two different delays. In this instance, since the scopes come from delays, you could share the value across both (since there isn't a newly introduced variable that they might depend on). However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

So, not at all obvious. And I don't know how to reason about it because you haven't told me what your algorithm is trying to do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it will definitely be computed.

But we only know that because we know what ifThenElse does. The compiler doesn't have this kind of info, for all it knows ifThenElse takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delays may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't assuming we knew it would definitely be computed, just that we could lift it anyway (and then get computation too soon).

total of 16 times).

Finally, since CSE can change the order or the number of occurrences of effects, it is only run
when conservative optimization is off.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we make it conservative by only CSE-ing pure expressions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that later as an improvement.

@michaelpj
Copy link
Contributor

Copy link
Member Author

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to some of the comments. Will address the other comments later.

)

In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.

of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
total of 16 times).

Finally, since CSE can change the order or the number of occurrences of effects, it is only run
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off

total of 16 times).

Finally, since CSE can change the order or the number of occurrences of effects, it is only run
when conservative optimization is off.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that later as an improvement.

Comment on lines 322 to 337
combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList
where
go ::
[(Path, (Term name uni fun (Path, ann), Int))] ->
[(Path, (Term name uni fun (Path, ann), Int))]
-- The new path is not a descendent-or-self of any existing path.
go [] = [(path, (t, 1))]
go ((path', (t', cnt)) : paths)
-- The new path is an ancestor-or-self of an existing path.
-- Take over all counts of the existing path, remove the existing path,
-- and continue.
| path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths
-- The new path is a descendent-or-self of an existing path.
-- Increment the count for the existing path. There can only be one such
-- existing path, so we don't need to recurse here.
| path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith.

1. Simplifications
-------------------------------------------------------------------------------

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.

I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData can be more performant.

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this implementation, it doesn't CSE work-free expressions anyway.

we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
- The builtin function arity information is approximate: rather than using the accurate arities,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.

@zliu41 zliu41 requested a review from mjaskelioff November 14, 2023 21:17
Copy link
Member Author

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.

The field accessors from pattern synonyms are less inlined, and thus less CSE-friendly.

-- Manual
(\d -> (\tup -> ...something... (force (force fstPair) tup))
       (unConstrData d)
       +
       (\tup -> ...something_else... (force (force fstPair) tup))
       (unConstrData d)
)

-- From pattern synonyms

(\d -> (\f -> f something + f something_else)
       (\tup x -> ...x... (force (force fstPair) tup) (unConstrData d))
)

In the latter, f is not inlined, and as a result, unConstrData d is not a common subexpression, and has to be evaluated multiple times.

GHC doesn't have this problem since it can perform full-laziness, but ours is not a lazy language and we can't do that (unless in some very clever way).

Also I verified that the improvements are due to CSE, not the additional simplifier passes. Increasing the number of simplifier passes from 12 to 16 alone has no effect whatsoever.

plutus-benchmark/plutus-benchmark.cabal Outdated Show resolved Hide resolved
( ToBuiltinMeaning uni fun
, MonadQuote m
, HasUnique name TermUnique
, Ord name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this has been discussed before but hasn't been fixed.

, MonadQuote m
, HasUnique name TermUnique
, Ord name
, Typeable name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you are right; let me try and see if it works.

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder

we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
- The builtin function arity information is approximate: rather than using the accurate arities,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,

  • If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE
  • On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".

Comment on lines +84 to +87
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
the candidate's path. The order is because a bigger expression may contain a small subexpression.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

@zliu41
Copy link
Member Author

zliu41 commented Nov 14, 2023

In the latter, f is not inlined, and as a result, unConstrData d is not a common subexpression, and has to be evaluated multiple times.

And this makes me think that we should probably make the inliner "dumber" (that is, inline more things, to expose opportunities for CSE).

@kwxm
Copy link
Contributor

kwxm commented Nov 15, 2023

There are a lot of comments on this already so maybe I'll wait until those have been addressed before having a proper look.

@kwxm kwxm closed this Nov 15, 2023
@kwxm
Copy link
Contributor

kwxm commented Nov 15, 2023

Rogue thumb.

@kwxm kwxm reopened this Nov 15, 2023
@michaelpj
Copy link
Contributor

The field accessors from pattern synonyms are less inlined, and thus less CSE-friendly.

I also checked and I think it's not quite what you wrote. With the pattern synonym version, all the field accessors use the "matching function" (mInts in this case). The field accessors themselves are inlined, but the matching function is not, because it is used several times, and is quite large. I suspect that's the same situation as we would see with your manually written accessors if they were used multiple times in the program.

@zliu41
Copy link
Member Author

zliu41 commented Nov 15, 2023

I suspect that's the same situation as we would see with your manually written accessors if they were used multiple times in the program.

See #5636 (comment)
Looks like CSE is more effective than I thought!

@zliu41
Copy link
Member Author

zliu41 commented Nov 16, 2023

By the way, I retract what I said about "CSE making the compilation of marlowe validator much slower". The marlowe validator already takes about 5s to compile on master on my laptop, and it takes a similar amount of time with CSE (I didn't measure precisely, though)

Copy link
Contributor

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the description so far. Amazing docs, thank you. I need to meditate on it for a bit.

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this implementation, it doesn't CSE work-free expressions anyway.

Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.

(1) what about non-closed terms

What about them?

(2) substitution is also harder

We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.

- The builtin function arity information is approximate: rather than using the accurate arities,
we simply use the maximum number of arguments applied to a builtin function in the program
as the builtin function's arity. The arity information is used to determine whether a builtin
application is possibly saturated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds so hacky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).

In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
non-workfree term, and `path` is its path. If the same term has two paths, and one is an
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is path a key in that map rather than a value (in a list)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to aggregate the counts on each path via Map.fromListWith.

Copy link
Contributor

@effectfully effectfully Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still don't get it. Can you have a map from hashes of terms to lists of paths, without even any counts, and then post-process each list of paths? If you have 2+x :-> ["0.1", "0.1.2", "0.1.3"], then during post-processing you figure out that "0.1" is the root and insert a binding at that point, ignoring all the path extensions. Technically, it's not even a list of paths as a value, it's a trie of paths and "post-processing" means simply taking the root of each forest at the beginning of the trie (unless the forest consists only of the root, i.e. the subexpression only occurs once).

UPD: as per my other comment, you don't even need to have a trie there. It's just a list of paths with each path not being an extension of another. When inserting you make sure that the path being inserted isn't an extension of an existing path and if it is, then the shortest one is chosen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example. Suppose for 2+x, we currently have the following counts:

"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30

Now we process path "0.1". After processing it, the counts will become:

"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1

Finally we use Map.fromListWith to aggregate them, and get

"0.1" -> 31, "0.2.2" -> 30

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that make sense?

Yes, it's what I was suggesting as well. I just though you kept the old paths too. I suppose I misread the documentation, but maybe you could make it clear there that the old paths aren't kept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually paths indeed don't need to be map keys... I'll update the code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner HashMap (with Path as key) is replaced with a list in b32cd3c, because in my example above, we can easily go from step 1 directly to step 3, bypassing step 2.


The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.

, (2+x) + (3+x)
, 4+x
]
) (1+(2+x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this demonstrates a deficiency with the approach. You're creating an extra lambda and then you dive inside of it and pull a subexpression out without really accounting for the lambda (or at least that's what I got from the description of the algorithm). Which is entirely fine! Because the lambda is an immediately applied one, it's always going to be evaluated, hence we can simply ignore it. But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems a little tricky. What if the new CSE term now includes cse1? The term below hoists cse2 outside the scope of cse1! At least it should probably be inside. I notice the the examples are inconsistent: below there is an example where we do put cse2 inside, which seems more correct.

I guess that if we do that it's safe, but the argument is non-trivial:

  • Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.
  • Then if we decide to hoist cse1, we must do so to the level indicated by the path to cse2, since cse1 is a subexpression of cse2
  • We thus in theory need to extend the path for cse2 by one step for the new binding, but since it occurs right under the previous last step, we can just use the old path so long as we put the binding for cse2 under the other CSE bindings.

I'm not even sure if that is correct. I hope we have some tests for this and we need a careful argument!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term below hoists cse2 outside the scope of cse1! At least it should probably be inside.

Seems correct to me? Not sure which term you're referring too. I've been ignoring indices in the names, only looking at semantics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

I don't believe it's going to fly. Plutus Core programs tend to have a long prefix of immediately applied lambdas, are you going to extract half a program outside those by picking every single subexpression that doesn't happen to reference some of the variables? Plus it'll create bindings for things that we don't really want to create bindings for, like what's the point of turning (\x -> 1 + 2 + x) e into (\y -> (\x -> y + x) e) (1 + 2) -- no work is saved and some evaluation cycles are wasted on creating a variable. And each such transformation only creates more immediately applied lambdas.

I suspect that full-laziness-like transformations only help when you lift a subexpression past a non-applied lambda. CSE is different in that regard: it helps whenever a subexpression occurs more than one time, regardless of whether you're inside an applied lambda or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

That's a bit subtle and needs to be written down. I had to think about it for a while. Here's my argument:

By processing CSE candidates in size order, we also implicitly process them in dependency order: if we have two CSE candidates cse1 and cse2, then we cannot end up in a situation where cse2 would appear in cse1, because then the original cse1 would have had to have been bigger than cse2.

I'm actually not sure this is a sound argument, because doing some rounds of CSE can change the size ordering. Can we end up with a situation like this?

  1. cse1 < cse2 < cse3
  2. Pull out cse1
  3. cse3 < cse2, and cse3 appears a subexpression of cse2
  4. Pull out cse2
  5. Pull out cse3, break the RHS of cse2

Looking at that, in fact I think what we need to be safe is just:

  1. Bindings for subsequent CSE candidates go inside the previous ones
  2. We only substitute for a CSE candidate inside the scope of the binding
    Which is much simpler and probably holds.

That still implies that this example is wrong and cse2 should be bound inside the scope of cse1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelpj I don't get your point at all. What do you think is the correct output for this example?

cse2 is smaller than cse1. Shouldn't the smaller one be in the outer scope? I.e., you'd want

let cse2 = 2 + x in
let cse1 = 1 + cse2 in body

If you flip it you'd get

let cse1 = 1 + (2 + x) in
let cse2 = 2 + x in body

Can we end up with a situation like this?

In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just had it backwards. You're doing them in descending order of size 🤦

Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.

Comment on lines +84 to +87
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
the candidate's path. The order is because a bigger expression may contain a small subexpression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

But you do perform substitution in CSs? 1+(2+x) is pulled out and then becomes 1+cse2. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.

)

In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it will definitely be computed.

But we only know that because we know what ifThenElse does. The compiler doesn't have this kind of info, for all it knows ifThenElse takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delays may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.

Copy link
Contributor

@mjaskelioff mjaskelioff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I appreciate the long explanation at the beginning.
I would definitely use the real builtin arity. Calculating it in this way seems messy and adds unnecessary complexity.
The path looks like a good abstraction for the boundaries where CSE may happen. I wonder if having an explicit transformation into a tree of top level, LamAbs, Delay, and Case nodes would make the correctness easier to see, but I guess that reasoning about prefixes is easy enough.

of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
total of 16 times).

Finally, since CSE can change the order or the number of occurrences of effects, it is only run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?

- The builtin function arity information is approximate: rather than using the accurate arities,
we simply use the maximum number of arguments applied to a builtin function in the program
as the builtin function's arity. The arity information is used to determine whether a builtin
application is possibly saturated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.

Copy link
Member Author

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed/responded to some comments.

we made are:

- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
- The builtin function arity information is approximate: rather than using the accurate arities,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- The builtin function arity information is approximate: rather than using the accurate arities,
we simply use the maximum number of arguments applied to a builtin function in the program
as the builtin function's arity. The arity information is used to determine whether a builtin
application is possibly saturated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659

In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
non-workfree term, and `path` is its path. If the same term has two paths, and one is an
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
instances.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to aggregate the counts on each path via Map.fromListWith.

Now, each `(term, path)` pair whose count is greater than 1 is a CSE candidate.
In the above example, the CSE candidates are `(2+x, "0.1")` and `(1+(2+x), "0.1")`.
Note that `3+x` is not a CSE candidate, because it has two paths, and neither has a count
greater than 1. `2+` is also not a CSE candidate, because it is workfree.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement a version of CSE that optimizes for size in the future. Some users to value size more than cost.

, (2+x) + (3+x)
, 4+x
]
) (1+(2+x))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

, MonadQuote m
, HasUnique name TermUnique
, Ord name
, Typeable name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
caseOfCase' = case eqT @fun @DefaultFun of
Just Refl -> caseOfCase
Nothing -> id

cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

@effectfully
Copy link
Contributor

effectfully commented Nov 17, 2023

I've spent some time pondering the general algorithm and I think what this PR does is possible to do better performance-wise. I don't kwow it we care about that though, maybe for a typical Plutus program the current algorithm is just fine, so feel free to ignore.

You check whether one path is an extension of another and do different things depending on the answer, but I think we can avoid this logic entirely. We're going to construct a DAG where each node is a HashMap from subexpressions to their counts and each edge is an ID (i.e. a step in a path). If we take your example

\x y -> (1+(2+x))
        +
        (case y [ (1+(2+x)) + (3+x)
                , (2+x) + (3+x)
                , 4+x
                ]
        )

then 2+x, 1+(2+x), case y <...> and 1+(2+x) + case y <...> are all going to be keys in a HashMap stored in a node, with values all being 1.

Similarly, 3+x is going to be stored together with 2+x, 1+(2+x) and (1+(2+x)) + (3+x) in another node, with the ID of the first branch of the case expression being an edge between the previosly described node and this one.

Each time you enter a new scope, you start from an empty HashMap and populate it with all the subexpressions appearing at the current path. Once this pass is completed, you have a DAG where all common subexpressions appearing at the same path are accounted for by their count being greater than 1 in the nodes of the graph. It only remains to process common subexpressions that occur at different paths, but since you have a DAG that guarantees that travelling by its edges always gives you an extension of the current path you can do exactly that: travel by the edges, merge the accumulator hashmap with the hashmap of each of the nodes and when you reach a leaf of the AST compute which subexpressions occurring at any point in the current path to pull out and where to put them.

The actual implementation may avoid materializing DAGs and just add some in-place logic doing the same. UPD: well, I guess it's not even a DAG, just a rose tree.

Fundamentally, the idea of your algorithm is reader-based, but you handle it with a state-based implementation and then you need to check path prefixes to account for the extra power of the implementation that you don't use.

But as per other comments, maybe that's fine, maybe we want a more general algorithm, so that in future we can optimize not only for evaluation costs, but also for the size of the script, which requires pulling common subexpressions occurring at different paths.

Copy link
Member Author

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to more comments.

, (2+x) + (3+x)
, 4+x
]
) (1+(2+x))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelpj I don't get your point at all. What do you think is the correct output for this example?

cse2 is smaller than cse1. Shouldn't the smaller one be in the outer scope? I.e., you'd want

let cse2 = 2 + x in
let cse1 = 1 + cse2 in body

If you flip it you'd get

let cse1 = 1 + (2 + x) in
let cse2 = 2 + x in body

Can we end up with a situation like this?

In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.


The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a pathological test case to see?

Something like

((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))
+
((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))

? Do you have a worse pathological case in mind?

In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
non-workfree term, and `path` is its path. If the same term has two paths, and one is an
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
instances.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example. Suppose for 2+x, we currently have the following counts:

"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30

Now we process path "0.1". After processing it, the counts will become:

"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1

Finally we use Map.fromListWith to aggregate them, and get

"0.1" -> 31, "0.2.2" -> 30

Does that make sense?

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.

@zliu41
Copy link
Member Author

zliu41 commented Nov 20, 2023

travel by the edges, merge the accumulator hashmap with the hashmap of each of the nodes and when you reach a leaf of the AST compute which subexpressions occurring at any point in the current path to pull out and where to put them.

How exactly does this work? Say a common subexpression e occurs at 0.1, 0.1.2 and 0.1.3. When you travel from 0.1 to 0.1.2, you decide to pull out e. That's fine. What happens if you then travel from 0.1 to 0.1.3, and also want to pull out e?

I think what this PR does is possible to do better performance-wise.

My implementation can be slightly improved asymptotically: currently for each expression, it is O(p^2) due to combinePaths, where p is the number of paths this expression occurs. But it can be changed to: for each path, simply check all its ancestors, in which case it becomes O(p * depth). If I understand what you propose correctly, it is not asymptotically better than that.

1. Simplifications
-------------------------------------------------------------------------------

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I think we might want to transfer a bit more of that reasoning into the notes?

The implementation makes several passes on the given term. The first pass collects builtin
arity information as described above.

In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And write something!!


The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks like a good example!

, (2+x) + (3+x)
, 4+x
]
) (1+(2+x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just had it backwards. You're doing them in descending order of size 🤦

Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.

Comment on lines +84 to +87
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
the candidate's path. The order is because a bigger expression may contain a small subexpression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x) in the RHS of the newly created binding for the first CSE candidate, right?

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
pure app

-- | This is the first example in Note [CSE].
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we've discussed quite a number of examples so far - this should have more than two tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a property test that cse term is alpha-beta-equal to term. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?

I do agree with Michael we should have more golden tests here though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the third example in the Note, as well as the pathological example.

@effectfully
Copy link
Contributor

How exactly does this work? Say a common subexpression e occurs at 0.1, 0.1.2 and 0.1.3. When you travel from 0.1 to 0.1.2, you decide to pull out e. That's fine. What happens if you then travel from 0.1 to 0.1.3, and also want to pull out e?

When you reach a leaf you record that e is chosen to be pulled out and the same happens in the second case, then you merge all the decisions into a single map (or whatever) from subexpressions to where to insert new bindings for them and then in one fell swoop you insert the bindings and replace all subexpressions with their respective bindings in the original term. Maybe it's more passes than it could be, but I'd rather have more passes for testability and debugging purposes than lump all the logic into as few passes as possible.

But it can be changed to: for each path, simply check all its ancestors, in which case it becomes O(p * depth). If I understand what you propose correctly, it is not asymptotically better than that.

Yes, what I propose is one way to check only ancestors without iterating through all possible paths. I used a DAG for the model to be "tangible" (easier to think about, easier to explain), but if you do it directly, then that'll be great too.

Copy link
Contributor

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks somewhat intimidating, but it's actually surprisingly straightforward. I've reviewed everything and it looks good to me, even though it's not optimal as we've discussed a lot here, but maybe it's efficient enough and we don't care about optimizing the whole thing.

I may have another run to reconcile the description of the algorithm with the implementation, but that shouldn't block the PR.

addToMap ::
Term name uni fun (Path, ann) ->
HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int)) ->
HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so you never actually put Term and Path together as a single key. I got confused about that when reading the description of the algorithm. It's probably me, but maybe you can make it more clear there.

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
pure app

-- | This is the first example in Note [CSE].
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a property test that cse term is alpha-beta-equal to term. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?

I do agree with Michael we should have more golden tests here though.

-- `isWorkFree` currently doesn't check whether a builtin application is saturated,
-- or whether an term is the (possibly repeated) forcing of a builtin (which should
-- be workfree), so we check it separately.
| isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing isn't workfree! We have deferred unlifting, but not deferred interleaving (see this ticket). It does make sense to pull force someBuiltin out, since it does actually compute, and we seem to have given up on the deferred interleaving idea. I think Plutonomy had an optimization of this kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this comment is an important one. Perhaps you can even remove the entire first pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin multiple times, but on the other hand, we evaluate LamAbs and Apply more times, which is more expensive!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we evaluate LamAbs and Apply more times, which is more expensive!

How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.

No, it's because when we CSE something, it creates an additional LamAbs and Apply.

Copy link
Member Author

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied to a few new review comments.

1. Simplifications
-------------------------------------------------------------------------------

This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll expand the notes a bit

In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
non-workfree term, and `path` is its path. If the same term has two paths, and one is an
ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
instances.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually paths indeed don't need to be map keys... I'll update the code

Comment on lines +84 to +87
The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
it for all occurrences in the original term whose paths are descendents (or self) of
the candidate's path. The order is because a bigger expression may contain a small subexpression.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "the partially processed term with some candidates hoisted out"?

Yes.

-- `isWorkFree` currently doesn't check whether a builtin application is saturated,
-- or whether an term is the (possibly repeated) forcing of a builtin (which should
-- be workfree), so we check it separately.
| isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin multiple times, but on the other hand, we evaluate LamAbs and Apply more times, which is more expensive!

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
pure app

-- | This is the first example in Note [CSE].
cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the third example in the Note, as well as the pathological example.

@zliu41
Copy link
Member Author

zliu41 commented Nov 22, 2023

but if you do it directly, then that'll be great too.

I didn't end up doing it, because (1) it isn't enough to check for ancestors, but we also need to check for descendants; (2) in some cases, such as the pathological example I added, the number of paths per expression (two) is much smaller than the depth of the tree, so path * depth is worse than path * path. And in reality, deep ASTs may not be uncommon.

then you merge all the decisions into a single map

I see. Time complexity wise, for each subexpression e that has p paths, in the worst case each path is in a different leaf node, so this is O(p * depth). Like I said above, this is not necessarily better than O(p * p).

({cpu: 9927795
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only we could do something like this to the Marlowe validators!

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
caseOfCase' = case eqT @fun @DefaultFun of
Just Refl -> caseOfCase
Nothing -> id

cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about requiring HasUnique? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.

, MonadQuote m
, HasUnique name TermUnique
, Ord name
, Typeable name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasUnique?

addInteger
(addInteger
(addInteger
(addInteger (addInteger cse cse) cse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going on here? We're not CSE-ing single variables are we? That would definitely not be a win...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not CSE-ing single variables are we?

I had it written in my notes to check specifically for that and then failed to actually do that, it's been a while since I felt like a complete moron so much...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems like a bug, since it applies CSE on partially applied addInteger, which it isn't supposed to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not a bug - the partially applied addInteger is in fact not a builtin..

(\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error)))
(int d))
(\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error)))
(\scrut cont fail ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually very nice: we CSE $mInts ds, i.e. the partial application.

@effectfully
Copy link
Contributor

Oh yeah, my approval only holds as long as cse is changed to perform renaming before processing the term as per Michael's comment.

@zliu41 zliu41 merged commit 1a4b503 into master Dec 8, 2023
4 checks passed
@zliu41 zliu41 deleted the zliu41/cse branch December 8, 2023 22:55
v0d1ch pushed a commit to v0d1ch/plutus that referenced this pull request Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants