CSE for UPLC #5634

zliu41 · 2023-11-14T01:02:03Z

See Note [CSE] in Cse.hs for how it works.

This currently doesn't help AsData's field accessors much, since those TH-generated field accessors are not CSE-friendly. I added a test case with manually written field accessors (See recordFieldsManual in AsData.Budget.Spec), and its performance is now comparable with patternMatching. We can easily generate those more efficient field accessors instead.

Besides recordFieldsManual, two unit tests are added in plutus-core/untyped-plutus-core/test/Transform/Simplify.hs.

effectfully

I've reviewed everything apart from the actual algorithm. I'll create a ticket for myself to review it, since it's definitely going to take me a while. Great docs by the way.

effectfully · 2023-11-14T06:56:56Z

plutus-benchmark/bls12-381-costs/test/9.2/bls12-381-costs.golden


 Schnorr Signature G1

    n     Script size             CPU usage               Memory usage
  ----------------------------------------------------------------------
-    -     425   (2.6%)       534466472   (5.3%)          265206   (1.9%) 
+    -     420   (2.6%)       344615074   (3.4%)          264670   (1.9%) 


effectfully · 2023-11-14T06:58:48Z

...role-payout/9.2/fc8c5f45ffcdb024c21e0f34b22c23de8045a94d5e1a5bda1555c45ddb059f82.eval.golden

+({cpu: 252572171
+| mem: 884340})


So damn tedious to review those golden files. I wonder if we can do better somehow. Don't have any ideas though, there are so many changes here that even putting them into a single file and using --word-diff wouldn't help much.

If the filenames said budget somewhere we could probably write a shell script that you could locally pipe the git diff though to get some kind of summary.

effectfully · 2023-11-14T07:00:23Z

...t/semantics/9.2/e34b48f80d49360e88c612f4016f7d68cb5678dd8cd5ddb981375a028b3a40a5.eval.golden

+({cpu: 536014965
+| mem: 1909052})


3% improvement, not bad.

effectfully · 2023-11-14T07:26:55Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  ( ToBuiltinMeaning uni fun
+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name


I know it's unrelated to your PR, but I bet Ord name is for the wrong reason here. I'll take a quick look.

Yes this has been discussed before but hasn't been fixed.

effectfully · 2023-11-14T08:04:28Z

plutus-core/plutus-ir/test/PlutusIR/Compiler/Recursion/factorial.golden

@@ -2,38 +2,38 @@
  1.1.0
  [
    [
-      (lam s_800 [ s_800 s_800 ])
+      (lam s_836 [ s_836 s_836 ])


Kinda weird that only uniques change in this file and nothing else. CSE performs an additional renaming?

effectfully · 2023-11-14T08:08:52Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Simplify.hs

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
    caseOfCase' = case eqT @fun @DefaultFun of
        Just Refl -> caseOfCase
        Nothing   -> id
+
+    cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
+    cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of


Hardcoding Name here is kinda weird. Any name should work as long as it has a TermUnique.

Agree, or maybe we just need Hashable name. In general, I think using Typeable to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.

Maybe it is needed for Hashable? 🤷‍♂️

I think the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

What about requiring HasUnique? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.

What about requiring HasUnique?

I also need freshName...

Okay, then just monomorphise to Name and TyName. The PIR compiler in fact runs on those types, and if that's the truth you might as well say it.

Bummer, and I was looking forward to generalize PIR to work with any name/tyname. But of course this i s noot the only place where we do fix to Name/TyName. Other places that I can think of is the PIR typechecker and the rewrite rules

effectfully · 2023-11-14T08:18:33Z

UPD: no, sorry, not everything, I didn't review all the golden files as it's infeasible.

michaelpj · 2023-11-14T10:40:49Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name
+  , Typeable name


As I commented on your other PR, I don't see why we can't just have a Hashable name constraint?

Probably you are right; let me try and see if it works.

Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

bezirg

This looks like a clear win in terms of execution (cpu&mem), and has mostly a positive effect on the script size as well. Nice work!

bezirg · 2023-11-14T10:17:24Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
+- The builtin function arity information is approximate: rather than using the accurate arities,


Wouldn't be easier to provide such arity information ourselves somewhere, for example, inside BuiltinsInfo?

See #5634 (comment)

bezirg · 2023-11-14T10:28:48Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of
+the candidate's path. The order is because a bigger expression may contain a small subexpression.


Does this really matter? In the below example, wouldn't it lead to a similar result irrespective of the order of CSE'ing?

No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

But you do perform substitution in CSs? 1+(2+x) is pulled out and then becomes 1+cse2. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.

But you do perform substitution in CSs?

No, what I meant to say: if we process 1+(2+x) first, then we only need to perform substitution in the original term itself. On the other hand, if we process 2+x first, we must also perform substitution in the CSs, i.e., turning 1+(2+x) into 1+cse2. Otherwise, when we process 1+(2+x) later, we will not find any 2+x in the original term.

To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x) in the RHS of the newly created binding for the first CSE candidate, right?

you mean "the partially processed term with some candidates hoisted out"?

Yes.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

bezirg · 2023-11-14T10:35:07Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
+total of 16 times).
+
+Finally, since CSE can change the order or the number of occurrences of effects, it is only run


I think there are certain builtins that are pure (don't throw). So the order or number of occurrences would not matter for those. I may be wrong though.

Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off

The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?

Yeah, we do worry about log order.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

bezirg · 2023-11-14T10:59:29Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList
+  where
+    go ::
+      [(Path, (Term name uni fun (Path, ann), Int))] ->
+      [(Path, (Term name uni fun (Path, ann), Int))]
+    -- The new path is not a descendent-or-self of any existing path.
+    go [] = [(path, (t, 1))]
+    go ((path', (t', cnt)) : paths)
+      -- The new path is an ancestor-or-self of an existing path.
+      -- Take over all counts of the existing path, remove the existing path,
+      -- and continue.
+      | path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths
+      -- The new path is a descendent-or-self of an existing path.
+      -- Increment the count for the existing path. There can only be one such
+      -- existing path, so we don't need to recurse here.
+      | path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths


It feels like going through List for the inner hashmap (fromlist . toList) when combining paths defeats the purpose of using a hashmap for this inner structure.

The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

bezirg · 2023-11-14T11:06:17Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+-- | The first pass. See Note [CSE].
+calcBuiltinArity ::
+  forall name uni fun ann.
+  (Hashable fun) =>
+  Term name uni fun ann ->
+  HashMap fun Int
+calcBuiltinArity = foldrOf termSubtermsDeep go Map.empty
+  where
+    go :: Term name uni fun ann -> HashMap fun Int -> HashMap fun Int
+    go = \case
+      t@Apply{}
+        | (Builtin _ fun, args) <- splitApplication t ->
+            Map.insertWith max fun (length args)
+      _ -> id


Again, I would rather have a hardcoded DefaultFun builtinarity :P

Agree with Nikos.

plutus-tx-plugin/plutus-tx-plugin.cabal

plutus-benchmark/plutus-benchmark.cabal

michaelpj

Just reviewed the algorithm description (thanks for that!) first, since that may well spur a bunch of discussion.

since those TH-generated field accessors are not CSE-friendly

Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.

michaelpj · 2023-11-14T10:41:43Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Simplify.hs

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
    caseOfCase' = case eqT @fun @DefaultFun of
        Just Refl -> caseOfCase
        Nothing   -> id
+
+    cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
+    cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of


Agree, or maybe we just need Hashable name. In general, I think using Typeable to make certain passes only work when the type parameters are instantiated a particular way is pretty bad.

michaelpj · 2023-11-14T10:42:22Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.


This seems like a big deal? I think that means we can never CSE any functions together?

Can we simply compute the hash after renaming the expression via runQuote . rename so that the uniques for bound variables start with 0? Haven't reviewed the algorithm description yet.

At least in this implementation, it doesn't CSE work-free expressions anyway.

Alpha-renaming may perhaps bring a huge benefit. But personally, I am ok to bring it later on a separate PR

And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder

At least in this implementation, it doesn't CSE work-free expressions anyway.

Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.

(1) what about non-closed terms

What about them?

(2) substitution is also harder

We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.

I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something

Didn't you write the typechecker? :p

I basically rewrote it, not just wrote it, and I was opposed to using global uniqueness there from before the beginning:

It took me a ridiculous amount of time to weed all the bugs out of the original implementation and to this day I still don't understand why we need global uniqueness there instead of having type values at the type level and reducing them using an abstract machine or whatever like people do.

Another thing that I forgot to comment on:

At least in this implementation, it doesn't CSE work-free expressions anyway.

Such an expression isn't necessarily work-free! You may have f (\x -> x) for example.

michaelpj · 2023-11-14T10:42:45Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
+- The builtin function arity information is approximate: rather than using the accurate arities,


I'll look at the code but this surprises me: we have all the machinery for this (in PIR at least)!

Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.

It's not that complicated IMO. It is a little fiddly because we need to pass BuiltinsInfo around, but that's very mechanical.

https://input-output.atlassian.net/browse/PLT-8659

michaelpj · 2023-11-14T11:03:32Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+1. Simplifications
+-------------------------------------------------------------------------------
+
+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications


Did you use any references for the implementation? Is this a standard algorithm? What variants do we know of? What does GHC do?

I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.

I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData can be more performant.

I guess I think we might want to transfer a bit more of that reasoning into the notes?

Ok, I'll expand the notes a bit

michaelpj · 2023-11-14T11:04:26Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


Why do we need one for a case branch? A case branch can be an arbitrary term, and if it's a function then there will be a lambda inside it, which you will give an id to?

I guess I'm missing the "abstract picture". What are we trying to do here? Identify unique scopes? or what?

Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,

If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE

On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".

This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).

I think there are maybe three answers?

They represent places where computation stops, i.e. subexpressions are not immediately evaluated, and as such they shouldn't be hoisted above that point lest they be evaluated too soon.

They represent places where evaluation may happen multiple times with a different environment (lambdas) or not (delays) i.e. subexpressions may not always be evaluated to the same thing so shouldn't be shared between multiple evaluations.

They may bind new variables, in which case it's actually incorrect to hoist a subexpression out of one.

I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

I guess the problem is the related notions of "common sub-expression" and "structurally identical term". In the presence of non-globally-unique names you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.

And write something!!

michaelpj · 2023-11-14T11:09:28Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+
+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of


So we do a substitution pass for every CSE candidate? Sounds quadratic.

As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.

Or you go over the whole program building up a single subtitution, but applying the substitution to new substitution-rhss that you insert. Maybe that's also quadratic, though! It seems less quadratic...

I think it is quadratic only in very pathological cases; and as such I agree we can leave it as it is in the initial version.

Can we make a pathological test case to see?

Can we make a pathological test case to see?

Something like

((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000)) + ((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))

? Do you have a worse pathological case in mind?

Yes, that looks like a good example!

I added a much smaller test case in 6b9e944.

It's not about CSE, but our test infrastructure simply can't handle anything close to the above term.
Even for a much smaller one, e.g., (1+2) + (3+4) + ... + (99999 + 100000) (i.e., more than 20 times smaller than the above term), the test times out without doing CSE or anything else!

the test times out without doing CSE or anything else!

Could you please create a ticket for that just so we have it recorded?

michaelpj · 2023-11-14T11:16:18Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+      )
+
+In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
+can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is


Why not? I think the answer is:

There is no scope enclosing both of them? If so, this is probably a bug, there should be a scope for "the top level" for sure.

They have differing paths, corresponding to the two different delays. In this instance, since the scopes come from delays, you could share the value across both (since there isn't a newly introduced variable that they might depend on). However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

So, not at all obvious. And I don't know how to reason about it because you haven't told me what your algorithm is trying to do!

However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.

so it will definitely be computed.

But we only know that because we know what ifThenElse does. The compiler doesn't have this kind of info, for all it knows ifThenElse takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delays may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.

I wasn't assuming we knew it would definitely be computed, just that we could lift it anyway (and then get computation too soon).

michaelpj · 2023-11-14T11:17:38Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+total of 16 times).
+
+Finally, since CSE can change the order or the number of occurrences of effects, it is only run
+when conservative optimization is off.


Can't we make it conservative by only CSE-ing pure expressions?

Yeah we can do that later as an improvement.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

michaelpj · 2023-11-14T11:42:50Z

https://kar.kent.ac.uk/21455/1/Common_Subexpression_Elimination_in_a_Lazy.pdf seems relevant?

zliu41

Responded to some of the comments. Will address the other comments later.

zliu41 · 2023-11-14T14:21:58Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+      )
+
+In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
+can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is


However, that might lead to the value being computed sooner than it would otherwise have been, which seems bad.

Ok, this is not a great example because (1+2) appears in both branches, so it will definitely be computed. A better example would be an expression that appears in two of the three case branches, in which case we definitely can't CSE it.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-14T14:34:34Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
+total of 16 times).
+
+Finally, since CSE can change the order or the number of occurrences of effects, it is only run


Yes we can be more clever here but the current implementation doesn't do that since it assumes conservative-optimizations is off

zliu41 · 2023-11-14T14:35:00Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+total of 16 times).
+
+Finally, since CSE can change the order or the number of occurrences of effects, it is only run
+when conservative optimization is off.


Yeah we can do that later as an improvement.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-14T14:44:33Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+combinePaths t path = Map.fromListWith (\(x, y) (_, y') -> (x, y + y')) . go . Map.toList
+  where
+    go ::
+      [(Path, (Term name uni fun (Path, ann), Int))] ->
+      [(Path, (Term name uni fun (Path, ann), Int))]
+    -- The new path is not a descendent-or-self of any existing path.
+    go [] = [(path, (t, 1))]
+    go ((path', (t', cnt)) : paths)
+      -- The new path is an ancestor-or-self of an existing path.
+      -- Take over all counts of the existing path, remove the existing path,
+      -- and continue.
+      | path `isAncestorOrSelf` path' = (path, (t, cnt)) : go paths
+      -- The new path is a descendent-or-self of an existing path.
+      -- Increment the count for the existing path. There can only be one such
+      -- existing path, so we don't need to recurse here.
+      | path' `isSuffixOf` path = (path', (t', cnt + 1)) : paths


The only reason to use a HashMap is the final step where we combine the counts of each path with Map.fromListWith.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-14T14:48:06Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+1. Simplifications
+-------------------------------------------------------------------------------
+
+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications


I mentioned the Plutonomy implementation in the doc: https://docs.google.com/document/d/1JsdfO0c4Vj6M04WKiWBespe8_J1DDKZi6mm9DOBmozs/edit. Mine is a very different approach and isn't similar to that at all.

I didn't look too closely to GHC's (since it is for a lazy language, so, just like inlining and other opimitzations, its strategy may not be applicable), but the doc discusses it a little bit. GHC has a much more sophisticated way of identifying common subexpressions and testing equality of subexpressions. We can look more into it in the future, but for now my goal is to get a simple implementation in, so that AsData can be more performant.

zliu41 · 2023-11-14T14:49:28Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.


At least in this implementation, it doesn't CSE work-free expressions anyway.

zliu41 · 2023-11-14T14:50:26Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
+- The builtin function arity information is approximate: rather than using the accurate arities,


Yes but it's complicated and difficult to port over to UPLC, so I took the lazier approach. I think this might be a good first issue for a new hire.

zliu41

Since I guess you've looked at it, can you tell us why not? Note that the accessors are not generated by TH. The TH generates pattern synonyms, and then GHC generates the accessors. So I don't know how easily we can fix them.

The field accessors from pattern synonyms are less inlined, and thus less CSE-friendly.

-- Manual
(\d -> (\tup -> ...something... (force (force fstPair) tup))
       (unConstrData d)
       +
       (\tup -> ...something_else... (force (force fstPair) tup))
       (unConstrData d)
)

-- From pattern synonyms

(\d -> (\f -> f something + f something_else)
       (\tup x -> ...x... (force (force fstPair) tup) (unConstrData d))
)

In the latter, f is not inlined, and as a result, unConstrData d is not a common subexpression, and has to be evaluated multiple times.

GHC doesn't have this problem since it can perform full-laziness, but ours is not a lazy language and we can't do that (unless in some very clever way).

Also I verified that the improvements are due to CSE, not the additional simplifier passes. Increasing the number of simplifier passes from 12 to 16 alone has no effect whatsoever.

plutus-benchmark/plutus-benchmark.cabal

zliu41 · 2023-11-14T21:19:24Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  ( ToBuiltinMeaning uni fun
+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name


Yes this has been discussed before but hasn't been fixed.

zliu41 · 2023-11-14T21:19:49Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name
+  , Typeable name


Probably you are right; let me try and see if it works.

zliu41 · 2023-11-14T21:25:23Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.


And it is difficult to do for several reasons: (1) what about non-closed terms; (2) substitution is also harder

zliu41 · 2023-11-14T21:41:00Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
+- The builtin function arity information is approximate: rather than using the accurate arities,


See #5634 (comment)

zliu41 · 2023-11-14T21:49:15Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


Yeah, I don't know how to describe this precisely, but the purpose is to avoid evaluating something that otherwise may not be evaluated. E.g.,

If there's a common subexpression in a parent scope and a child scope (e.g., "0.1", and "0.1.2"), then it is safe to perform CSE

On the other hand, if it is in two scopes that don't have parent-child relationship (e.g., "0.1.2" and "0.1.3"), then it is not safe to perform CSE. Because if you do perform CSE, the new "binding" has to be in scope "0.1", but it is possible that "0.1" is evaluated without evaluating either "0.1.2" or "0.1.3".

zliu41 · 2023-11-14T21:52:18Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of
+the candidate's path. The order is because a bigger expression may contain a small subexpression.


No because the bigger CS contains the smaller CS.
You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

zliu41 · 2023-11-14T22:09:51Z

In the latter, f is not inlined, and as a result, unConstrData d is not a common subexpression, and has to be evaluated multiple times.

And this makes me think that we should probably make the inliner "dumber" (that is, inline more things, to expose opportunities for CSE).

kwxm · 2023-11-15T07:32:46Z

There are a lot of comments on this already so maybe I'll wait until those have been addressed before having a proper look.

kwxm · 2023-11-15T07:32:53Z

Rogue thumb.

michaelpj · 2023-11-15T09:34:01Z

The field accessors from pattern synonyms are less inlined, and thus less CSE-friendly.

I also checked and I think it's not quite what you wrote. With the pattern synonym version, all the field accessors use the "matching function" (mInts in this case). The field accessors themselves are inlined, but the matching function is not, because it is used several times, and is quite large. I suspect that's the same situation as we would see with your manually written accessors if they were used multiple times in the program.

zliu41 · 2023-11-15T23:02:28Z

I suspect that's the same situation as we would see with your manually written accessors if they were used multiple times in the program.

See #5636 (comment)
Looks like CSE is more effective than I thought!

zliu41 · 2023-11-16T00:03:28Z

By the way, I retract what I said about "CSE making the compilation of marlowe validator much slower". The marlowe validator already takes about 5s to compile on master on my laptop, and it takes a similar amount of time with CSE (I didn't measure precisely, though)

effectfully

Only reviewed the description so far. Amazing docs, thank you. I need to meditate on it for a bit.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

effectfully · 2023-11-16T07:41:33Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications
+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.


At least in this implementation, it doesn't CSE work-free expressions anyway.

Well, size is even more important than cpu and mem as people keep telling us, so maybe we're optimizing for the wrong thing here.

(1) what about non-closed terms

What about them?

(2) substitution is also harder

We do uniqueness-preserving substitution at the type level for type checking purposes (I don't understand why though, I find our type checking strategy convoluted for no good reason, but perhaps I'm missing something), the term level isn't any more complicated as far as I understand. Although I appreciate that in CSE case substitution is "backwards" (variables for term, not term for variable) and that may complicate things, but then again you can just give every subterm a unique id and put into the annotation, so that you don't need to check equality of terms at substitution time.

effectfully · 2023-11-16T07:43:54Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+- The builtin function arity information is approximate: rather than using the accurate arities,
+  we simply use the maximum number of arguments applied to a builtin function in the program
+  as the builtin function's arity. The arity information is used to determine whether a builtin
+  application is possibly saturated.


That sounds so hacky.

There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.

Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659

effectfully · 2023-11-16T07:47:47Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


This is a great explanation, please add it to the docs. Also please add there a comment on why it's lambdas, delays and case branches (it's kinda obvious why if the reader thinks about it for a few seconds, but writing it down would save the reader contemplating it).

effectfully · 2023-11-16T08:05:24Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
+non-workfree term, and `path` is its path. If the same term has two paths, and one is an
+ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
+instances.


Why is path a key in that map rather than a value (in a list)?

Because we need to aggregate the counts on each path via Map.fromListWith.

Sorry, still don't get it. Can you have a map from hashes of terms to lists of paths, without even any counts, and then post-process each list of paths? If you have 2+x :-> ["0.1", "0.1.2", "0.1.3"], then during post-processing you figure out that "0.1" is the root and insert a binding at that point, ignoring all the path extensions. Technically, it's not even a list of paths as a value, it's a trie of paths and "post-processing" means simply taking the root of each forest at the beginning of the trie (unless the forest consists only of the root, i.e. the subexpression only occurs once).

UPD: as per my other comment, you don't even need to have a trie there. It's just a list of paths with each path not being an extension of another. When inserting you make sure that the path being inserted isn't an extension of an existing path and if it is, then the shortest one is chosen.

Here's an example. Suppose for 2+x, we currently have the following counts:

"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30

Now we process path "0.1". After processing it, the counts will become:

"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1

Finally we use Map.fromListWith to aggregate them, and get

"0.1" -> 31, "0.2.2" -> 30

Does that make sense?

Does that make sense?

Yes, it's what I was suggesting as well. I just though you kept the old paths too. I suppose I misread the documentation, but maybe you could make it clear there that the old paths aren't kept.

Actually paths indeed don't need to be map keys... I'll update the code

The inner HashMap (with Path as key) is replaced with a list in b32cd3c, because in my example above, we can easily go from step 1 directly to step 3, bypassing step 2.

effectfully · 2023-11-16T08:46:57Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+
+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of


As discussed during the call it's probably worth saving the substitution in the annotation to process all of them at once in a subsequent pass. I'm fine with that not being in the initial version.

effectfully · 2023-11-16T09:49:17Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+                          , (2+x) + (3+x)
+                          , 4+x
+                          ]
+        ) (1+(2+x))


I think this demonstrates a deficiency with the approach. You're creating an extra lambda and then you dive inside of it and pull a subexpression out without really accounting for the lambda (or at least that's what I got from the description of the algorithm). Which is entirely fine! Because the lambda is an immediately applied one, it's always going to be evaluated, hence we can simply ignore it. But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Yeah, this seems a little tricky. What if the new CSE term now includes cse1? The term below hoists cse2 outside the scope of cse1! At least it should probably be inside. I notice the the examples are inconsistent: below there is an example where we do put cse2 inside, which seems more correct.

I guess that if we do that it's safe, but the argument is non-trivial:

Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.

Then if we decide to hoist cse1, we must do so to the level indicated by the path to cse2, since cse1 is a subexpression of cse2

We thus in theory need to extend the path for cse2 by one step for the new binding, but since it occurs right under the previous last step, we can just use the old path so long as we put the binding for cse2 under the other CSE bindings.

I'm not even sure if that is correct. I hope we have some tests for this and we need a careful argument!

The term below hoists cse2 outside the scope of cse1! At least it should probably be inside.

Seems correct to me? Not sure which term you're referring too. I've been ignoring indices in the names, only looking at semantics.

But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

I don't believe it's going to fly. Plutus Core programs tend to have a long prefix of immediately applied lambdas, are you going to extract half a program outside those by picking every single subexpression that doesn't happen to reference some of the variables? Plus it'll create bindings for things that we don't really want to create bindings for, like what's the point of turning (\x -> 1 + 2 + x) e into (\y -> (\x -> y + x) e) (1 + 2) -- no work is saved and some evaluation cycles are wasted on creating a variable. And each such transformation only creates more immediately applied lambdas.

I suspect that full-laziness-like transformations only help when you lift a subexpression past a non-applied lambda. CSE is different in that regard: it helps whenever a subexpression occurs more than one time, regardless of whether you're inside an applied lambda or not.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

That's a bit subtle and needs to be written down. I had to think about it for a while. Here's my argument:

By processing CSE candidates in size order, we also implicitly process them in dependency order: if we have two CSE candidates cse1 and cse2, then we cannot end up in a situation where cse2 would appear in cse1, because then the original cse1 would have had to have been bigger than cse2.

I'm actually not sure this is a sound argument, because doing some rounds of CSE can change the size ordering. Can we end up with a situation like this?

cse1 < cse2 < cse3

Pull out cse1

cse3 < cse2, and cse3 appears a subexpression of cse2

Pull out cse2

Pull out cse3, break the RHS of cse2

Looking at that, in fact I think what we need to be safe is just:

Bindings for subsequent CSE candidates go inside the previous ones

We only substitute for a CSE candidate inside the scope of the binding
Which is much simpler and probably holds.

That still implies that this example is wrong and cse2 should be bound inside the scope of cse1.

@michaelpj I don't get your point at all. What do you think is the correct output for this example?

cse2 is smaller than cse1. Shouldn't the smaller one be in the outer scope? I.e., you'd want

let cse2 = 2 + x in let cse1 = 1 + cse2 in body

If you flip it you'd get

let cse1 = 1 + (2 + x) in let cse2 = 2 + x in body

Can we end up with a situation like this?

In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.

Okay, I just had it backwards. You're doing them in descending order of size 🤦

Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.

effectfully · 2023-11-16T09:54:26Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of
+the candidate's path. The order is because a bigger expression may contain a small subexpression.


You can do it the other way around, but then you not only need to perform substitution in the original term, but also in the CS themselves

But you do perform substitution in CSs? 1+(2+x) is pulled out and then becomes 1+cse2. It seems like it's a matter of where to add new bindings: with bigger CS first you put them at the beginning of the sequence of new bindings and with smaller CS first you put them at the end.

effectfully · 2023-11-16T09:58:15Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+      )
+
+In this case, the first two occurrences of `1+2` can be CSE'd, but the third occurrence
+can not. This is ensured by checking the path when substituting `cse1` for `1+2`. The result is


so it will definitely be computed.

But we only know that because we know what ifThenElse does. The compiler doesn't have this kind of info, for all it knows ifThenElse takes some arguments are returns something that is potentially completely unrelated to the arguments, in which case the delays may never get evaluated even though they appear in both the "branches". We can enhance the compiler with such information of course, but it isn't a given.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

mjaskelioff

Looks good to me. I appreciate the long explanation at the beginning.
I would definitely use the real builtin arity. Calculating it in this way seems messy and adds unnecessary complexity.
The path looks like a good abstraction for the boundaries where CSE may happen. I wonder if having an explicit transformation into a tree of top level, LamAbs, Delay, and Case nodes would make the correctness easier to see, but I guess that reasoning about prefixes is easy enough.

mjaskelioff · 2023-11-16T15:46:09Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+of CSE, with a simplifier pass after each iteration of CSE (i.e., the simplifier is run for a
+total of 16 times).
+
+Finally, since CSE can change the order or the number of occurrences of effects, it is only run


The exception effect is commutative and idempotent, so it shouldn't be a problem. Are you worried about changing the order of logs?

mjaskelioff · 2023-11-16T17:54:45Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+- The builtin function arity information is approximate: rather than using the accurate arities,
+  we simply use the maximum number of arguments applied to a builtin function in the program
+  as the builtin function's arity. The arity information is used to determine whether a builtin
+  application is possibly saturated.


There is a possibility of under or over estimating the arity of builtins. This makes it more difficult to reason about the correctness of the algorithm.

…1/cse

zliu41

Addressed/responded to some comments.

zliu41 · 2023-11-17T02:11:56Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+we made are:
+
+- No alpha equivalence check, i.e., `\x -> x` and `\y -> y` are considered different expressions.
+- The builtin function arity information is approximate: rather than using the accurate arities,


https://input-output.atlassian.net/browse/PLT-8659

zliu41 · 2023-11-17T02:12:18Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+- The builtin function arity information is approximate: rather than using the accurate arities,
+  we simply use the maximum number of arguments applied to a builtin function in the program
+  as the builtin function's arity. The arity information is used to determine whether a builtin
+  application is possibly saturated.


Good point. Will address this later: https://input-output.atlassian.net/browse/PLT-8659

zliu41 · 2023-11-17T02:15:21Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
+non-workfree term, and `path` is its path. If the same term has two paths, and one is an
+ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
+instances.


Because we need to aggregate the counts on each path via Map.fromListWith.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-17T02:17:54Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+Now, each `(term, path)` pair whose count is greater than 1 is a CSE candidate.
+In the above example, the CSE candidates are `(2+x, "0.1")` and `(1+(2+x), "0.1")`.
+Note that `3+x` is not a CSE candidate, because it has two paths, and neither has a count
+greater than 1. `2+` is also not a CSE candidate, because it is workfree.


We could implement a version of CSE that optimizes for size in the future. Some users to value size more than cost.

zliu41 · 2023-11-17T02:28:20Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+                          , (2+x) + (3+x)
+                          , 4+x
+                          ]
+        ) (1+(2+x))


But then shouldn't we handle such lambdas the same way regardless of whether they were originally in the program or were added by CSE?

Good point. What I'm not sure is: whether CSE should handle this, or should we implement another optimization, that floats terms inside an immediately-applied lambda that don't depend on the lambda-bound variable, outside of the lambda (i.e., similar to full-laziness).

Suppose we have two CSE candidates cse1 and cse2, where cse1 occurs within cse2.

That's why the CSE candidates are processed in descending order of size. If we process cse2 first, then there's no problem.

zliu41 · 2023-11-17T02:31:53Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name
+  , Typeable name


Oh, I remember why I didn't do that: the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

zliu41 · 2023-11-17T02:33:04Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Simplify.hs

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
    caseOfCase' = case eqT @fun @DefaultFun of
        Just Refl -> caseOfCase
        Nothing   -> id
+
+    cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
+    cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of


I think the current implementation of CSE only works correctly for Name. We can't just do the substitutions that we do for DeBruijn terms.

zliu41 · 2023-11-17T02:35:00Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


I think point 1 is correct. I'll add it to the doc. I don't agree with 2 or 3: if it is incorrect to hoist something (because, e.g., it depends on the lambda-bound variable), then it can't be a common subexpression anyway.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

effectfully · 2023-11-17T11:39:41Z

I've spent some time pondering the general algorithm and I think what this PR does is possible to do better performance-wise. I don't kwow it we care about that though, maybe for a typical Plutus program the current algorithm is just fine, so feel free to ignore.

You check whether one path is an extension of another and do different things depending on the answer, but I think we can avoid this logic entirely. We're going to construct a DAG where each node is a HashMap from subexpressions to their counts and each edge is an ID (i.e. a step in a path). If we take your example

\x y -> (1+(2+x))
        +
        (case y [ (1+(2+x)) + (3+x)
                , (2+x) + (3+x)
                , 4+x
                ]
        )

then 2+x, 1+(2+x), case y <...> and 1+(2+x) + case y <...> are all going to be keys in a HashMap stored in a node, with values all being 1.

Similarly, 3+x is going to be stored together with 2+x, 1+(2+x) and (1+(2+x)) + (3+x) in another node, with the ID of the first branch of the case expression being an edge between the previosly described node and this one.

Each time you enter a new scope, you start from an empty HashMap and populate it with all the subexpressions appearing at the current path. Once this pass is completed, you have a DAG where all common subexpressions appearing at the same path are accounted for by their count being greater than 1 in the nodes of the graph. It only remains to process common subexpressions that occur at different paths, but since you have a DAG that guarantees that travelling by its edges always gives you an extension of the current path you can do exactly that: travel by the edges, merge the accumulator hashmap with the hashmap of each of the nodes and when you reach a leaf of the AST compute which subexpressions occurring at any point in the current path to pull out and where to put them.

The actual implementation may avoid materializing DAGs and just add some in-place logic doing the same. UPD: well, I guess it's not even a DAG, just a rose tree.

Fundamentally, the idea of your algorithm is reader-based, but you handle it with a state-based implementation and then you need to check path prefixes to account for the extra power of the implementation that you don't use.

But as per other comments, maybe that's fine, maybe we want a more general algorithm, so that in future we can optimize not only for evaluation costs, but also for the size of the script, which requires pulling common subexpressions occurring at different paths.

zliu41

Responded to more comments.

zliu41 · 2023-11-20T21:29:48Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+                          , (2+x) + (3+x)
+                          , 4+x
+                          ]
+        ) (1+(2+x))


@michaelpj I don't get your point at all. What do you think is the correct output for this example?

cse2 is smaller than cse1. Shouldn't the smaller one be in the outer scope? I.e., you'd want

let cse2 = 2 + x in let cse1 = 1 + cse2 in body

If you flip it you'd get

let cse1 = 1 + (2 + x) in let cse2 = 2 + x in body

Can we end up with a situation like this?

In your example you are processing CSE candidates in asencding order of size, while I'm proposing descending order.

zliu41 · 2023-11-20T21:31:39Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+
+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of


Can we make a pathological test case to see?

Something like

((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000)) + ((1+2) + (3+4) + (5+6) + ... + (999999 + 1000000))

? Do you have a worse pathological case in mind?

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-20T21:42:31Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
+non-workfree term, and `path` is its path. If the same term has two paths, and one is an
+ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
+instances.


Here's an example. Suppose for 2+x, we currently have the following counts:

"0.1.2" -> 10, "0.1.3" -> 20, "0.2.2" -> 30

Now we process path "0.1". After processing it, the counts will become:

"0.1" -> 10, "0.1" -> 20, "0.2.2" -> 30`, "0.1" -> 1

Finally we use Map.fromListWith to aggregate them, and get

"0.1" -> 31, "0.2.2" -> 30

Does that make sense?

zliu41 · 2023-11-20T21:43:54Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


you might have structurally identical terms that are not in fact identical, which might require this kind of care. Are we assuming global uniqueness?

Hmm, I guess that is indeed possible. I'll make it do renaming before CSE.

zliu41 · 2023-11-20T22:02:03Z

travel by the edges, merge the accumulator hashmap with the hashmap of each of the nodes and when you reach a leaf of the AST compute which subexpressions occurring at any point in the current path to pull out and where to put them.

How exactly does this work? Say a common subexpression e occurs at 0.1, 0.1.2 and 0.1.3. When you travel from 0.1 to 0.1.2, you decide to pull out e. That's fine. What happens if you then travel from 0.1 to 0.1.3, and also want to pull out e?

I think what this PR does is possible to do better performance-wise.

My implementation can be slightly improved asymptotically: currently for each expression, it is O(p^2) due to combinePaths, where p is the number of paths this expression occurs. But it can be changed to: for each path, simply check all its ancestors, in which case it becomes O(p * depth). If I understand what you propose correctly, it is not asymptotically better than that.

michaelpj · 2023-11-21T11:02:51Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+1. Simplifications
+-------------------------------------------------------------------------------
+
+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications


I guess I think we might want to transfer a bit more of that reasoning into the notes?

michaelpj · 2023-11-21T11:03:58Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The implementation makes several passes on the given term. The first pass collects builtin
+arity information as described above.
+
+In the second pass, we assign a unique ID to each `LamAbs`, `Delay`, and each `Case` branch.


And write something!!

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

michaelpj · 2023-11-21T11:08:11Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+
+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of


Yes, that looks like a good example!

michaelpj · 2023-11-21T11:13:40Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+                          , (2+x) + (3+x)
+                          , 4+x
+                          ]
+        ) (1+(2+x))


Okay, I just had it backwards. You're doing them in descending order of size 🤦

Then yes, you need to put each subsequent one outside the other ones. And putting the new ones outside should always be safe.

michaelpj · 2023-11-21T11:14:55Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of
+the candidate's path. The order is because a bigger expression may contain a small subexpression.


To be clear, when you say "perform substitution in the original term itself", you mean "the partially processed term with some candidates hoisted out"? Because you do want to hoist out the instance of (2+x) in the RHS of the newly created binding for the first CSE candidate, right?

michaelpj · 2023-11-21T11:15:34Z

plutus-core/untyped-plutus-core/test/Transform/Simplify.hs

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
      app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
  pure app

+-- | This is the first example in Note [CSE].
+cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()


I feel like we've discussed quite a number of examples so far - this should have more than two tests!

I think we need to have a property test that cse term is alpha-beta-equal to term. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?

I do agree with Michael we should have more golden tests here though.

I'll add the third example in the Note, as well as the pathological example.

effectfully · 2023-11-21T11:55:49Z

How exactly does this work? Say a common subexpression e occurs at 0.1, 0.1.2 and 0.1.3. When you travel from 0.1 to 0.1.2, you decide to pull out e. That's fine. What happens if you then travel from 0.1 to 0.1.3, and also want to pull out e?

When you reach a leaf you record that e is chosen to be pulled out and the same happens in the second case, then you merge all the decisions into a single map (or whatever) from subexpressions to where to insert new bindings for them and then in one fell swoop you insert the bindings and replace all subexpressions with their respective bindings in the original term. Maybe it's more passes than it could be, but I'd rather have more passes for testability and debugging purposes than lump all the logic into as few passes as possible.

But it can be changed to: for each path, simply check all its ancestors, in which case it becomes O(p * depth). If I understand what you propose correctly, it is not asymptotically better than that.

Yes, what I propose is one way to check only ancestors without iterating through all possible paths. I used a DAG for the model to be "tangible" (easier to think about, easier to explain), but if you do it directly, then that'll be great too.

effectfully

The code looks somewhat intimidating, but it's actually surprisingly straightforward. I've reviewed everything and it looks good to me, even though it's not optimal as we've discussed a lot here, but maybe it's efficient enough and we don't care about optimizing the whole thing.

I may have another run to reconcile the description of the algorithm with the implementation, but that shouldn't block the PR.

effectfully · 2023-11-22T13:50:16Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+    addToMap ::
+      Term name uni fun (Path, ann) ->
+      HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int)) ->
+      HashMap (Term name uni fun ()) (HashMap Path (Term name uni fun (Path, ann), Int))


Oh, so you never actually put Term and Path together as a single key. I got confused about that when reading the description of the algorithm. It's probably me, but maybe you can make it more clear there.

effectfully · 2023-11-22T13:53:21Z

plutus-core/untyped-plutus-core/test/Transform/Simplify.hs

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
      app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
  pure app

+-- | This is the first example in Note [CSE].
+cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()


I think we need to have a property test that cse term is alpha-beta-equal to term. I feel like it wouldn't be too hard to extend our alpha-equality checking algorithm to support beta-equality as well, but maybe I'm being foolish. In any case, it's not worth blocking this PR on property tests and I'd be happy to work on those if we think it's a good idea. @zliu41 should I create a ticket?

I do agree with Michael we should have more golden tests here though.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

effectfully · 2023-11-22T15:15:50Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+      -- `isWorkFree` currently doesn't check whether a builtin application is saturated,
+      -- or whether an term is the (possibly repeated) forcing of a builtin (which should
+      -- be workfree), so we check it separately.
+      | isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id


Forcing isn't workfree! We have deferred unlifting, but not deferred interleaving (see this ticket). It does make sense to pull force someBuiltin out, since it does actually compute, and we seem to have given up on the deferred interleaving idea. I think Plutonomy had an optimization of this kind.

^ this comment is an important one. Perhaps you can even remove the entire first pass?

I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin multiple times, but on the other hand, we evaluate LamAbs and Apply more times, which is more expensive!

we evaluate LamAbs and Apply more times, which is more expensive!

How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.

How come? Were you eta-expanding builtin applications? Or is it just the cost of creating a binding? Sounds strange to me.

No, it's because when we CSE something, it creates an additional LamAbs and Apply.

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41

Replied to a few new review comments.

zliu41 · 2023-11-22T18:35:45Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+1. Simplifications
+-------------------------------------------------------------------------------
+
+This is a simplified (i.e., not fully optimal) implementation of CSE. The two simplifications


Ok, I'll expand the notes a bit

zliu41 · 2023-11-22T18:37:24Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+In the third pass, we calculate a count for each `(term, path)` pair, where `term` is a
+non-workfree term, and `path` is its path. If the same term has two paths, and one is an
+ancestor (i.e., prefix) of the other, we increment the count for the ancestor path in both
+instances.


Actually paths indeed don't need to be map keys... I'll update the code

zliu41 · 2023-11-22T18:40:54Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+The CSE candidates are then processed in descending order of their `termSize`s. For each CSE
+candidate, we generate a fresh variable, create a LamAbs for it under its path, and substitute
+it for all occurrences in the original term whose paths are descendents (or self) of
+the candidate's path. The order is because a bigger expression may contain a small subexpression.


you mean "the partially processed term with some candidates hoisted out"?

Yes.

zliu41 · 2023-11-22T18:43:25Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

+      -- `isWorkFree` currently doesn't check whether a builtin application is saturated,
+      -- or whether an term is the (possibly repeated) forcing of a builtin (which should
+      -- be workfree), so we check it separately.
+      | isWorkFree t0 || not (isBuiltinSaturated t0) || isForcingBuiltin t0 = id


I initially did this and it made things worse, I think because the on the one hand, we avoid evaluating force someBuiltin multiple times, but on the other hand, we evaluate LamAbs and Apply more times, which is more expensive!

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Transform/Cse.hs

zliu41 · 2023-11-22T18:45:33Z

plutus-core/untyped-plutus-core/test/Transform/Simplify.hs

@@ -226,6 +226,35 @@ multiApp = runQuote $ do
      app = mkIterAppNoAnn lam [mkConstant @Integer () 1, mkConstant @Integer () 2, mkConstant @Integer () 3]
  pure app

+-- | This is the first example in Note [CSE].
+cse1 :: Term Name PLC.DefaultUni PLC.DefaultFun ()


I'll add the third example in the Note, as well as the pathological example.

…1/cse

zliu41 · 2023-11-22T23:55:25Z

but if you do it directly, then that'll be great too.

I didn't end up doing it, because (1) it isn't enough to check for ancestors, but we also need to check for descendants; (2) in some cases, such as the pathological example I added, the number of paths per expression (two) is much smaller than the depth of the tree, so path * depth is worse than path * path. And in reality, deep ASTs may not be uncommon.

then you merge all the decisions into a single map

I see. Time complexity wise, for each subexpression e that has p paths, in the worst case each path is in a different leaf node, so this is O(p * depth). Like I said above, this is not necessarily better than O(p * p).

zliu41 · 2023-11-23T00:06:48Z

plutus-tx-plugin/test/AsData/Budget/9.2/patternMatching-budget.eval.golden

If only we could do something like this to the Marlowe validators!

michaelpj · 2023-11-23T11:05:01Z

plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Simplify.hs

@@ -74,3 +89,10 @@ simplifyTerm opts = simplifyNTimes (_soMaxSimplifierIterations opts)
    caseOfCase' = case eqT @fun @DefaultFun of
        Just Refl -> caseOfCase
        Nothing   -> id
+
+    cseStep :: Int -> Term name uni fun a -> m (Term name uni fun a)
+    cseStep _ = case (eqT @name @Name, eqT @uni @PLC.DefaultUni) of


What about requiring HasUnique? That's usually what we do in some situations, and it's properly polymorphic instead of this abomination.

michaelpj · 2023-11-23T11:05:17Z

plutus-core/plutus-core/src/PlutusCore/Compiler/Types.hs

+  , MonadQuote m
+  , HasUnique name TermUnique
+  , Ord name
+  , Typeable name


michaelpj · 2023-11-23T11:09:27Z

plutus-tx-plugin/test/AsData/Budget/9.6/recordFields.uplc.golden

+                                addInteger
+                                  (addInteger
+                                     (addInteger
+                                        (addInteger (addInteger cse cse) cse)


What is going on here? We're not CSE-ing single variables are we? That would definitely not be a win...

We're not CSE-ing single variables are we?

I had it written in my notes to check specifically for that and then failed to actually do that, it's been a while since I felt like a complete moron so much...

Hmm this seems like a bug, since it applies CSE on partially applied addInteger, which it isn't supposed to do.

Actually not a bug - the partially applied addInteger is in fact not a builtin..

michaelpj · 2023-11-23T11:11:17Z

plutus-tx-plugin/test/AsData/Budget/9.6/recordFields.uplc.golden

-                (\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error)))
-             (int d))
-          (\ds -> `$mInts` ds (\ds ds ds ds -> ds) (\void -> error)))
-       (\scrut cont fail ->


This is actually very nice: we CSE $mInts ds, i.e. the partial application.

effectfully · 2023-11-23T11:51:05Z

Oh yeah, my approval only holds as long as cse is changed to perform renaming before processing the term as per Michael's comment.

…1/cse

CSE for UPLC

8c1c248

zliu41 requested review from bezirg, michaelpj, kwxm and effectfully November 14, 2023 01:02

fix build

9b177ec

effectfully reviewed Nov 14, 2023

View reviewed changes

michaelpj reviewed Nov 14, 2023

View reviewed changes

bezirg approved these changes Nov 14, 2023

View reviewed changes

michaelpj reviewed Nov 14, 2023

View reviewed changes

zliu41 commented Nov 14, 2023

View reviewed changes

zliu41 requested a review from mjaskelioff November 14, 2023 21:17

zliu41 commented Nov 14, 2023

View reviewed changes

zliu41 mentioned this pull request Nov 14, 2023

Add recordFieldsManual test #5636

Merged

kwxm closed this Nov 15, 2023

kwxm reopened this Nov 15, 2023

effectfully reviewed Nov 16, 2023

View reviewed changes

mjaskelioff approved these changes Nov 16, 2023

View reviewed changes

zliu41 added 2 commits November 16, 2023 17:26

Merge branch 'master' of github.com:input-output-hk/plutus into zliu4…

32d9224

…1/cse

Address comments

2fbc465

zliu41 commented Nov 17, 2023

View reviewed changes

zliu41 commented Nov 20, 2023

View reviewed changes

michaelpj reviewed Nov 21, 2023

View reviewed changes

effectfully approved these changes Nov 22, 2023

View reviewed changes

zliu41 commented Nov 22, 2023

View reviewed changes

zliu41 added 4 commits November 22, 2023 13:39

Merge branch 'master' of github.com:input-output-hk/plutus into zliu4…

7f7344d

…1/cse

No need to use nested maps

b32cd3c

Add test case cse3

4257d90

Add test case cseExpensive

6b9e944

zliu41 force-pushed the zliu41/cse branch from 8147cbb to 6b9e944 Compare November 22, 2023 23:41

zliu41 commented Nov 23, 2023

View reviewed changes

rename subst

4dceb27

michaelpj reviewed Nov 23, 2023

View reviewed changes

zliu41 added 3 commits December 7, 2023 17:20

Merge branch 'master' of github.com:input-output-hk/plutus into zliu4…

2ea22a4

…1/cse

fix build

a7813d3

fix build

f65c911

zliu41 merged commit 1a4b503 into master Dec 8, 2023
4 checks passed

zliu41 deleted the zliu41/cse branch December 8, 2023 22:55

effectfully mentioned this pull request May 21, 2024

Add let binding to UPLC? #4797

Open

v0d1ch pushed a commit to v0d1ch/plutus that referenced this pull request Dec 6, 2024

CSE for UPLC (IntersectMBO#5634)

3c62d5b

CSE for UPLC #5634

CSE for UPLC #5634

Conversation

zliu41 commented Nov 14, 2023 • edited Loading

effectfully left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

effectfully commented Nov 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bezirg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelpj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

effectfully Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zliu41 commented Nov 14, 2023 •

edited

Loading

effectfully Nov 14, 2023 •

edited

Loading