feat: fork update #33

tamirhemo · 2024-03-17T00:02:06Z

No description provided.

* add prove_goldilocks_poseidon example in keccak-air * cargo fmt

There seems to be a minor issue with prove_goldilocks_poseidon.rs. As far as I can tell this should fix it.

… (#270) * convert vec to cow * clippy

* Adding a Halve Field For essentially all Fields (Assuming characteristic is not 2) we can halve elements without doing a multiplication. This should be both faster and easier to use as we no longer need to have F:two().inverse() floating around. * Added generic helper functions for halve. Added two helper functions to field\src\helpers to compute the halve function for 32 and 64 bit fields. Ideally these should be merged but I can't currently see how to do that. Moved halve back to Field trait (instead of AbstractField trait). * Adding Comments

* PCS domain rework - pcs and uni-stark done, actual pcs are not small fixes remove Point gat; open at Challenge remove multi-stark cargo fix cargo fix twoadic open decompose move domain to commit forgot to add cargo fix hmm i have a bug for degree > 3 fix recompose for higher rate fix pcs bugs rm comments clippy, fmt lints remove crates from ci iterate correctly loop better rm comments lint, remove decomposes rm tensor-pcs TwoAdicMultiplicativeCoset fix trivialpcs * fix tests, rm comments * fmt * fix gf poseidon * clippy * rename Domain -> Space * no_std * fix testing

…#248) * Benchmarking for FieldArray Exp as title * Improving Sub and Add in M31 After chatting with Jacqui I thought I'd try and make a couple of minor improvements to the base add/sub methods. Sub seems to be ~20% faster, add is closer to 5% faster. * Avoiding a proliferation of bench files. * Update Cargo.toml * Re-adding bench for add/sub * Combining Bench File Made a new benchmark file to combine the benches for add, sub, inverse and 5th root. Fixed some formatting issues. Also replaced overflowing_add_unsigned by overflowing_add as it seems to be faster for some reason. * clippy * Cleaning Up Removing the old commented out implementation of sum and replacing it with a link. * Further Optimisations Some black magic optimisations thanks to Jacqui. Improves the compiled assembly output. * Small MDS 16x16 matrix Initial commit, many fixes left to do * Added General Functions Adding general functions to compute the convolution using a Karatsuba approach for all 2^n. Currently these are much much slower than the specialized functions (20x or more slower). Need to keep investigating why this is the case. * Improved Karatsuba Algorithms Improved the Karatsuba algorithms to perform a single step of the FFT algorithm before resorting to Karatsuba methods. This improves constants by a reasonable margin and get us to the point where the Karatsuba 16 convolution is now 25-30% faster than the n^2 implementation. Additionally implemented this method for convolutions of size 32 making use of i128's. This gives a massive (> 50%) speed up over the current implementation as it allows for delayed reductions. * Added: 'from_wrapped_u128' to field specification. Naively extended all the from_wrapped_u64 methods to from_wrapped_u128. These methods could likely be improved. * Added Comments Added a lot of comments to all functions. Introduced a new type SimpleInteger so that the functions in here will work with both i64 and i128 as well as potentially other types. Cleaned up some of the code and moved things around somewhat. * Improved Karatsuba Decomposition The Even-Odd decomposition seems to be faster than the Left-Right decomposition. Additionally extended the code to work with matrices of size 64x64. Still currently quite a bit of copy and pasted code so it would be good to find a way to cut down on that. Also seems to be doing a bunch of messing around with data. (E.g. similar to bit reversals which turn up in FFT's). Ideally we would find a way to avoid these. * Formatting * Minor Updates Moved code around somewhat to make it easier to follow. Slightly improved the recombination code for convolutions to use less additions and use less variables by passing in a mutable output vector. * Formatting cargo fmt... * Updated Large Convolutions Added code to speed up large convolutions, allowing us to work basically just with i64's. Unfortunately this involved duplicating a bunch of functions. Will try to clean it up over the next couple of days. * Improving reductions Added specific code to do the reductions in the final step. Minor formatting issues. * Minor fixes Cleaned up the code which converted back to a Field element in the last step. Added comments and debug_assert. * Setting up code for babybear Starting work on a major overhaul to allow code to work with multiple fields. * Completeing Proof Completed the proof that the suggested algorithm is valid. * Code Refactoring Continued Large refactoring of Code. Introduced 2 new types, corresponding to the trait NonCanonicalPrimeField32 and started working on making the convolution code more polymorphic which should let us reduce how much code is currently there. (In particular we should be able to get rid of all the large_entries functions.) More work to do but the fundamentals are coming together. I'm also wondering if NonCanonicalPrimeField32 might be able to be used in FFT's as well to help implement delayed reductions. (But that it for a future project) * Finished Convolution Updates It should be now possibly to switch from convolutions in M31 to convolutions in BB essentially for free. Speed seems to be unaffected. Was also able to simplify code to eliminate some repartitions. Still need to update code for non power of 2 sizes but this should be an easy change. * Ported BabyBear convolutions over Massive speed up on the current implementations, particularly for 32 and 64 cases. Also got the n = 12 case done. * Formatting * Added some dot products. * Initial Attempt at Noncanonical FFTS Initial try to see if we can easily implement delayed reduction for FFT's. Had to shuffle around code to prevent a circular dependancy chain. * More Generic Functions Implementing Hamish's idea to reduce the amount of code needed for Karatsuba Convolutions. * Passing in Output Minor change to have the output passed in. More tests to come on this. Also fixed some formatting and deleted some now unneeded code. * Adding debug checks and unsafe markers As title * Minor formatting and comments * Rolling back DFT changes. Removing the DFT changes. Might do something in a future pull request but not now. * Formatting Forgot to run Cargo fmt on the last upload... Also removing code I added to dft/naive. * Removing changes to dft/traits. This should be the last file I need to roll back. * Use faster reduction in Goldilocks. * Move field-specific MDS code to the fields' crates. * Move field-specific MDS code to the respective field trait. * Finish moving MDS code to field traits; miscellaneous tidying. * Update bench imports after move. * Comments and fiddling with memory loads. * Faster/cleaner i64 and i128 reductions. * Better reduction; implement size 24; more benchmarks. * Refactor `Convolve::apply()`. * Use small size 16 MDS matrix for Goldilocks. * Update expected output for Goldilocks MDS 16 after changing matrix. * Simplify reduction since the final values must be non-negative. * Expand documentation. * Update Rescue Prime expected outputs to use new size 12 MDS matrix. * Cargo fmt * Remove old non-canonical files. * Remove left-over debugging info. * Remove now unused linear_combination methods. * Replace `mul` with `parity_dot` as main template method. * Use Angus's faster partial reduction trick. * Add `parity_dot` for BabyBear and associated reduction. * Minor tidying. * Encourage rather than force inlining for large convlutions. * Add two functions missed in the merge. * Clippy. * Documentation. * Implement `from_wrapped_u128` for NEON `PackedField`s * Tidy comments following Angus's suggestions. * Remove unused `from_wrapped_u128`. * Update mds/src/karatsuba_convolution.rs Co-authored-by: Jacqueline Nabaglo <[email protected]> * Replacing Signed by Negacyclic Signed Convolutions already have a name, Negacyclic! This update is simply an in place replacement of every instance of signed by negacyclic. * Minor Fix This should make all the tests pass. * Add debug assert and comment The dot_product function will fail if N = 0 but this would be a little annoying to fix and should never come up. * Make conv_n, negacyclic_conv_n private Moving conv_n and negacyclic_conv_n out of the trait to make the functions private. --------- Co-authored-by: AngusG <[email protected]> Co-authored-by: Jakub Nabaglo <[email protected]> Co-authored-by: Daniel Lubarov <[email protected]> Co-authored-by: AngusG <[email protected]> Co-authored-by: Jacqueline Nabaglo <[email protected]>

* Adding Sum Benchmark Adding a benchmark to see how fast the current iterator summation functions are. (Suspect this might be one of the bottlenecks) * Code to test Summations Simple testing code to look at delayed reduction in summations. As can be seen, delayed reduction is much faster for larger sizes (16 and up). Thus we should be using this for larger sizes for Poseidon2. * Removing MDS dependency from Poseidon 2 We never used MDS but needed features = ["min_const_gen"] in rand which was being gotten indirectly from loading in the MDS crate. * Refactor Babybear Moving all BabyBear specifics into the BabyBear crate. This will allow us to access the internals without having to call to_monty or from_monty. * First Algorithm Updates Added delayed reduction for the summation and saved the multiplicative constants already in MONTY form. ~20% speed up for WIDTH 16 ~25% speed up for WIDTH 24. * Testing faster exp^7 methods Seems like I can get ~ a 10% improvement. Possibly more can be squeezed out with some refinements to monty_square_reduce and monty_cube_reduce. The numbers in the multi_mul benchmarks seem to be quite misleading and disagree a lot with the numbers in the poseidon2 benchmark. * Speed Improvments Slightly improved the monty_square_reduce algorithm. Using this to compute `x -> x^7` we get a `~25 - 30%` speed improvement over the current implementation. This leads to an immediate `10-15%` extra speed up for Poseidon2 and should also speed up any other algorithms using the `x -> x^7` map. It can also probably be used to improve the inverse and root_7 functions though this will be more work. * Fixing bench Making the Exp 7 benchmark correctly compare the three different natural methods. * Fixing Constants Added a constant function to convert an array of u32's into an array of BabyBear elements in Monty form. Needed as the ToMonty function can changes on different architectures. * Removing delayed reduction code Going to deal with that in a different pull request. * Cleaning up Sum Benchmarking Code I've been informed that benchmarks in the single ns are a bit BS so added a loop to push all timings into the microseconds. * Refactored Goldilocks Moved the Goldilocks specific code into the Goldilocks module. * fmt and added comment * Fixing Imports * More small fixes Hope these are the last. Just making sure all imports of DiffusionMatrixBabybear come from the right place. * Using delayed summation for Goldilocks Slightly faster for sizes >= 16. Also fixed Clippy Error. * Cargo fmt.... * Update baby-bear/src/baby_bear.rs Co-authored-by: Hamish Ivey-Law <[email protected]> * Update baby-bear/src/baby_bear.rs Co-authored-by: Hamish Ivey-Law <[email protected]> * Update baby-bear/src/baby_bear.rs Co-authored-by: Hamish Ivey-Law <[email protected]> * Update poseidon2/src/matrix.rs Co-authored-by: Hamish Ivey-Law <[email protected]> * Update poseidon2/src/matrix.rs Co-authored-by: Hamish Ivey-Law <[email protected]> * Resolving most requested changes. Changed sum benchmarks to take in an input giving number of repetitions and length of the summation. Added comments above sum_u64/sum_u128 specifying when they should be used. (E.g. for what summation lengths they are faster.). Updated sum_u128 to use reduce128 instead of a %. Fixed comment in baby-bear/poseidon2. Added rand = { version = "0.8.5", features = ["min_const_gen"] } to [dev-dependencies] for babybear and goldilocks. Technically these aren't needed as both field crates import rand and poseidon2 and poseidon2 imports min_const_gen so min_const_gen ends up getting enabled in the field crates anyway. But this prevent strange errors from appearing if we at some point rework poseidon2 and remove min_const_gen from its imports. * Fixing the interaction with Packed Fields Accidentally broke things when trying to specialize to BabyBear. * Fixing code to work with Packed Fields This should fix the issue with non compiling when AVX/Neon is enabled. To fix this I had to roll back the summation changes I made, I'll push them in a separate PR to do all three fields at once. * Minor Refactor Added a constant method in goldilocks to convert a constant array into a constant array of goldilocks elements. This makes the: matmul_internal method again identical between the two fields so we can store it in the Poseidon2 crate for now. Long term we likely want to specialize this further. * Minor Typo Fixes --------- Co-authored-by: Hamish Ivey-Law <[email protected]> Co-authored-by: Hamish Ivey-Law <[email protected]>

* AVX-512 implementation of Mersenne-31 * Update AVX-512 feature name * Only enable AVX512 nightly feature if we have AVX512 * Unused import * Another unused import * Bugfix * Lints * Apply suggestions from code review Hamish typo fixes Co-authored-by: Hamish Ivey-Law <[email protected]> --------- Co-authored-by: Hamish Ivey-Law <[email protected]>

* BabyBear AVX-512 * Unused import * Minor math notation * Documentation * Minor grammar

* Slight improvement to the length 8 MDS Vector This vector should be faster than the old one as the difference vector `vec[x] - vec[x + 4]` is `(-1, -2, -1, -1)` so the size 4 negacyclic convolution will be faster. See: https://godbolt.org/z/1oeK6rTsh * Fixing Tests Updating the vector means we need to update tests.

* p3-circle * figured out split * fix opening for nonstandard domain * aaaaaa * pack cfft * hmm cfft lde slightly slower. why * add comments to circle lde * prove_m31_keccak * cargo fix * cargo fmt * rename * cleanup * fix imports and tests * clippy * fmt * clippy * fix parallel build * cleanup * MatrixRowChunksMut * more MatrixRowChunksMut * fmt * clippy * fix Send bounds * forgot one more generic * oops accidental test * address review comments, mostly docs * move gemv_tr to matrix, rename columnwise_dot_product * oops forgot add routines.rs * rm unused imports * fix uni_to_point, rename, add docs

onurinanc and others added 16 commits February 26, 2024 23:44

add prove_goldilocks_poseidon example in keccak-air (#263)

7495aae

* add prove_goldilocks_poseidon example in keccak-air * cargo fmt

Fix Prove_Goldilocks_Poseidon (#268)

c2bea7c

There seems to be a minor issue with prove_goldilocks_poseidon.rs. As far as I can tell this should fix it.

Convert VirtualPairCol internal to a Cow pointer instead of Vec…

4aa498d

… (#270) * convert vec to cow * clippy

Monolith bench for width 12 as well as 16 (#251)

d8d2b53

Imports

8cdaee7

Corrected debug output (#276)

404785f

BabyBear AVX-512 (#275)

f74e5bd

* BabyBear AVX-512 * Unused import * Minor math notation * Documentation * Minor grammar

Set F to the base field in VerifierConstraintFolder (#281)

6144d64

update fork

a5da07c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fork update #33

feat: fork update #33

tamirhemo commented Mar 17, 2024

feat: fork update #33

Are you sure you want to change the base?

feat: fork update #33

Conversation

tamirhemo commented Mar 17, 2024