Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fork update #33

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

feat: fork update #33

wants to merge 16 commits into from

Conversation

tamirhemo
Copy link
Contributor

No description provided.

onurinanc and others added 16 commits February 26, 2024 23:44
* add prove_goldilocks_poseidon example in keccak-air

* cargo fmt
There seems to be a minor issue with prove_goldilocks_poseidon.rs.

As far as I can tell this should fix it.
* Adding a Halve Field

For essentially all Fields (Assuming characteristic is not 2) we can halve elements without doing a multiplication. This should be both faster and easier to use as we no longer need to have F:two().inverse() floating around.

* Added generic helper functions for halve.

Added two helper functions to field\src\helpers to compute the halve function for 32 and 64 bit fields. Ideally these should be merged but I can't currently see how to do that.

Moved halve back to Field trait (instead of AbstractField trait).

* Adding Comments
* PCS domain rework - pcs and uni-stark done, actual pcs are not

small fixes

remove Point gat; open at Challenge

remove multi-stark

cargo fix

cargo fix

twoadic open

decompose

move domain to commit

forgot to add

cargo fix

hmm

i have a bug for degree > 3

fix recompose for higher rate

fix pcs bugs

rm comments

clippy, fmt

lints

remove crates from ci

iterate correctly

loop better

rm comments

lint, remove decomposes

rm tensor-pcs

TwoAdicMultiplicativeCoset

fix trivialpcs

* fix tests, rm comments

* fmt

* fix gf poseidon

* clippy

* rename Domain -> Space

* no_std

* fix testing
…#248)

* Benchmarking for FieldArray Exp

as title

* Improving Sub and Add in M31

After chatting with Jacqui I thought I'd try and make a couple of minor improvements to the base add/sub methods.

Sub seems to be ~20% faster, add is closer to 5% faster.

* Avoiding a proliferation of bench files.

* Update Cargo.toml

* Re-adding bench for add/sub

* Combining Bench File

Made a new benchmark file to combine the benches for add, sub, inverse and 5th root.

Fixed some formatting issues.

Also replaced overflowing_add_unsigned by overflowing_add as it seems to be faster for some reason.

* clippy

* Cleaning Up

Removing the old commented out implementation of sum and replacing it with a link.

* Further Optimisations

Some black magic optimisations thanks to Jacqui.

Improves the compiled assembly output.

* Small MDS 16x16 matrix

Initial commit, many fixes left to do

* Added General Functions

Adding general functions to compute the convolution using a Karatsuba approach for all 2^n. Currently these are much much slower than the specialized functions (20x or more slower). Need to keep investigating why this is the case.

* Improved Karatsuba Algorithms

Improved the Karatsuba algorithms to perform a single step of the FFT algorithm before resorting to Karatsuba methods.

This improves constants by a reasonable margin and get us to the point where the Karatsuba 16 convolution is now 25-30% faster than the n^2 implementation.

Additionally implemented this method for convolutions of size 32 making use of i128's. This gives a massive (> 50%) speed up over the current implementation as it allows for delayed reductions.

* Added: 'from_wrapped_u128' to field specification.

Naively extended all the from_wrapped_u64 methods to from_wrapped_u128. These methods could likely be improved.

* Added Comments

Added a lot of comments to all functions.

Introduced a new type SimpleInteger so that the functions in here will work with both i64 and i128 as well as potentially other types.

Cleaned up some of the code and moved things around somewhat.

* Improved Karatsuba Decomposition

The Even-Odd decomposition seems to be faster than the Left-Right decomposition.

Additionally extended the code to work with matrices of size 64x64.

Still currently quite a bit of copy and pasted code so it would be good to find a way to cut down on that. Also seems to be doing a bunch of messing around with data. (E.g. similar to bit reversals which turn up in FFT's). Ideally we would find a way to avoid these.

* Formatting

* Minor Updates

Moved code around somewhat to make it easier to follow.

Slightly improved the recombination code for convolutions to use less additions and use less variables by passing in a mutable output vector.

* Formatting

cargo fmt...

* Updated Large Convolutions

Added code to speed up large convolutions, allowing us to work basically just with i64's.

Unfortunately this involved duplicating a bunch of functions. Will try to clean it up over the next couple of days.

* Improving reductions

Added specific code to do the reductions in the final step.

Minor formatting issues.

* Minor fixes

Cleaned up the code which converted back to a Field element in the last step.

Added comments and debug_assert.

* Setting up code for babybear

Starting work on a major overhaul to allow code to work with multiple fields.

* Completeing Proof

Completed the proof that the suggested algorithm is valid.

* Code Refactoring

Continued Large refactoring of Code.

Introduced 2 new types, corresponding to the trait NonCanonicalPrimeField32 and started working on making the convolution code more polymorphic which should let us reduce how much code is currently there. (In particular we should be able to get rid of all the large_entries functions.)

More work to do but the fundamentals are coming together.

I'm also wondering if NonCanonicalPrimeField32 might be able to be used in FFT's as well to help implement delayed reductions. (But that it for a future project)

* Finished Convolution Updates

It should be now possibly to switch from convolutions in M31 to convolutions in BB essentially for free. Speed seems to be unaffected.

Was also able to simplify code to eliminate some repartitions.

Still need to update code for non power of 2 sizes but this should be an easy change.

* Ported BabyBear convolutions over

Massive speed up on the current implementations, particularly for 32 and 64 cases.

Also got the n = 12 case done.

* Formatting

* Added some dot products.

* Initial Attempt at Noncanonical FFTS

Initial try to see if we can easily implement delayed reduction for FFT's.

Had to shuffle around code to prevent a circular dependancy chain.

* More Generic Functions

Implementing Hamish's idea to reduce the amount of code needed for Karatsuba Convolutions.

* Passing in Output

Minor change to have the output passed in. More tests to come on this.

Also fixed some formatting and deleted some now unneeded code.

* Adding debug checks and unsafe markers

As title

* Minor formatting and comments

* Rolling back DFT changes.

Removing the DFT changes. Might do something in a future pull request but not now.

* Formatting

Forgot to run Cargo fmt on the last upload... Also removing code I added to dft/naive.

* Removing changes to dft/traits.

This should be the last file I need to roll back.

* Use faster reduction in Goldilocks.

* Move field-specific MDS code to the fields' crates.

* Move field-specific MDS code to the respective field trait.

* Finish moving MDS code to field traits; miscellaneous tidying.

* Update bench imports after move.

* Comments and fiddling with memory loads.

* Faster/cleaner i64 and i128 reductions.

* Better reduction; implement size 24; more benchmarks.

* Refactor `Convolve::apply()`.

* Use small size 16 MDS matrix for Goldilocks.

* Update expected output for Goldilocks MDS 16 after changing matrix.

* Simplify reduction since the final values must be non-negative.

* Expand documentation.

* Update Rescue Prime expected outputs to use new size 12 MDS matrix.

* Cargo fmt

* Remove old non-canonical files.

* Remove left-over debugging info.

* Remove now unused linear_combination methods.

* Replace `mul` with `parity_dot` as main template method.

* Use Angus's faster partial reduction trick.

* Add `parity_dot` for BabyBear and associated reduction.

* Minor tidying.

* Encourage rather than force inlining for large convlutions.

* Add two functions missed in the merge.

* Clippy.

* Documentation.

* Implement `from_wrapped_u128` for NEON `PackedField`s

* Tidy comments following Angus's suggestions.

* Remove unused `from_wrapped_u128`.

* Update mds/src/karatsuba_convolution.rs

Co-authored-by: Jacqueline Nabaglo <[email protected]>

* Replacing Signed by Negacyclic

Signed Convolutions already have a name, Negacyclic!

This update is simply an in place replacement of every instance of signed by negacyclic.

* Minor Fix

This should make all the tests pass.

* Add debug assert and comment

The dot_product function will fail if N = 0 but this would be a little annoying to fix and should never come up.

* Make conv_n, negacyclic_conv_n private

Moving conv_n and negacyclic_conv_n out of the trait to make the functions private.

---------

Co-authored-by: AngusG <[email protected]>
Co-authored-by: Jakub Nabaglo <[email protected]>
Co-authored-by: Daniel Lubarov <[email protected]>
Co-authored-by: AngusG <[email protected]>
Co-authored-by: Jacqueline Nabaglo <[email protected]>
* Adding Sum Benchmark

Adding a benchmark to see how fast the current iterator summation functions are.

(Suspect this might be one of the bottlenecks)

* Code to test Summations

Simple testing code to look at delayed reduction in summations. As can be seen, delayed reduction is much faster for larger sizes (16 and up).

Thus we should be using this for larger sizes for Poseidon2.

* Removing MDS dependency from Poseidon 2

We never used MDS but needed features = ["min_const_gen"] in rand which was being gotten indirectly from loading in the MDS crate.

* Refactor Babybear

Moving all BabyBear specifics into the BabyBear crate.

This will allow us to access the internals without having to call to_monty or from_monty.

* First Algorithm Updates

Added delayed reduction for the summation and saved the multiplicative constants already in MONTY form.

~20% speed up for WIDTH 16
~25% speed up for WIDTH 24.

* Testing faster exp^7 methods

Seems like I can get ~ a 10% improvement. Possibly more can be squeezed out with some refinements to monty_square_reduce and monty_cube_reduce.

The numbers in the multi_mul benchmarks seem to be quite misleading and disagree a lot with the numbers in the poseidon2 benchmark.

* Speed Improvments

Slightly improved the monty_square_reduce algorithm. Using this to compute `x -> x^7` we get a `~25 - 30%` speed improvement over the current implementation.

This leads to an immediate `10-15%` extra speed up for Poseidon2 and should also speed up any other algorithms using the `x -> x^7` map.

It can also probably be used to improve the inverse and root_7 functions though this will be more work.

* Fixing bench

Making the Exp 7 benchmark correctly compare the three different natural methods.

* Fixing Constants

Added a constant function to convert an array of u32's into an array of BabyBear elements in Monty form.

Needed as the ToMonty function can changes on different architectures.

* Removing delayed reduction code

Going to deal with that in a different pull request.

* Cleaning up Sum Benchmarking Code

I've been informed that benchmarks in the single ns are a bit BS so added a loop to push all timings into the microseconds.

* Refactored Goldilocks

Moved the Goldilocks specific code into the Goldilocks module.

* fmt and added comment

* Fixing Imports

* More small fixes

Hope these are the last. Just making sure all imports of DiffusionMatrixBabybear come from the right place.

* Using delayed summation for Goldilocks

Slightly faster for sizes >= 16.

Also fixed Clippy Error.

* Cargo fmt....

* Update baby-bear/src/baby_bear.rs

Co-authored-by: Hamish Ivey-Law <[email protected]>

* Update baby-bear/src/baby_bear.rs

Co-authored-by: Hamish Ivey-Law <[email protected]>

* Update baby-bear/src/baby_bear.rs

Co-authored-by: Hamish Ivey-Law <[email protected]>

* Update poseidon2/src/matrix.rs

Co-authored-by: Hamish Ivey-Law <[email protected]>

* Update poseidon2/src/matrix.rs

Co-authored-by: Hamish Ivey-Law <[email protected]>

* Resolving most requested changes.

Changed sum benchmarks to take in an input giving number of repetitions and length of the summation.

Added comments above sum_u64/sum_u128 specifying when they should be used. (E.g. for what summation lengths they are faster.). Updated sum_u128 to use reduce128 instead of a %.

Fixed comment in baby-bear/poseidon2.

Added

rand = { version = "0.8.5", features = ["min_const_gen"] }

to [dev-dependencies] for babybear and goldilocks.

Technically these aren't needed as both field crates import rand and poseidon2 and poseidon2 imports min_const_gen so min_const_gen  ends up getting enabled in the field crates anyway. But this prevent strange errors from appearing if we at some point rework poseidon2 and remove min_const_gen from its imports.

* Fixing the interaction with Packed Fields

Accidentally broke things when trying to specialize to BabyBear.

* Fixing code to work with Packed Fields

This should fix the issue with non compiling when AVX/Neon is enabled.

To fix this I had to roll back the summation changes I made, I'll push them in a separate PR to do all three fields at once.

* Minor Refactor

Added a constant method in goldilocks to convert a constant array into a constant array of goldilocks elements.

This makes the: matmul_internal method again identical between the two fields so we can store it in the Poseidon2 crate for now. Long term we likely want to specialize this further.

* Minor Typo Fixes

---------

Co-authored-by: Hamish Ivey-Law <[email protected]>
Co-authored-by: Hamish Ivey-Law <[email protected]>
* AVX-512 implementation of Mersenne-31

* Update AVX-512 feature name

* Only enable AVX512 nightly feature if we have AVX512

* Unused import

* Another unused import

* Bugfix

* Lints

* Apply suggestions from code review

Hamish typo fixes

Co-authored-by: Hamish Ivey-Law <[email protected]>

---------

Co-authored-by: Hamish Ivey-Law <[email protected]>
* BabyBear AVX-512

* Unused import

* Minor math notation

* Documentation

* Minor grammar
* Slight improvement to the length 8 MDS Vector

This vector should be faster than the old one as the difference vector `vec[x] - vec[x + 4]` is `(-1, -2, -1, -1)` so the size 4 negacyclic convolution will be faster.

See: https://godbolt.org/z/1oeK6rTsh

* Fixing Tests

Updating the vector means we need to update tests.
* p3-circle

* figured out split

* fix opening for nonstandard domain

* aaaaaa

* pack cfft

* hmm cfft lde slightly slower. why

* add comments to circle lde

* prove_m31_keccak

* cargo fix

* cargo fmt

* rename

* cleanup

* fix imports and tests

* clippy

* fmt

* clippy

* fix parallel build

* cleanup

* MatrixRowChunksMut

* more MatrixRowChunksMut

* fmt

* clippy

* fix Send bounds

* forgot one more generic

* oops accidental test

* address review comments, mostly docs

* move gemv_tr to matrix, rename columnwise_dot_product

* oops forgot add routines.rs

* rm unused imports

* fix uni_to_point, rename, add docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants