Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load/Store, masked set and counting operations #2430

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mazimkhan
Copy link

Introduces:

  • variants of load and store operations including masked variants (MaskedLoadU, LoadHigher, StoreTruncated)
  • Counting functions to find information about the data in each lane of a vector (MaskedLeadingZeroCountOrZero, AllOnes, AllZeros)
  • Masked vector instantiation operations (SetOr, SetOrZero).

"OrZero" operations will return zero where the mask is false whereas standard masking returns the corresponding lane of a passed vector.

All introduced operations are implemented in generic_ops-inl.h and in arm_sve-inl.h where there is a performance gain to be made. Testing is also performed for all new operations.

Copy link

google-cla bot commented Jan 6, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@@ -429,6 +429,10 @@ for comparisons, for example `Lt` instead of `operator<`.
the result, with `t0` in the least-significant (lowest-indexed) lane of each
128-bit block and `tK` in the most-significant (highest-indexed) lane of
each 128-bit block: `{t0, t1, ..., tK}`
* <code>V **SetOr**(V no, M m, T a)</code>: returns N-lane vector with lane
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming to MaskedSetOr() and MaskedSet()?

@@ -1029,6 +1037,12 @@ Per-lane variable shifts (slow if SSSE3/SSE4, or 16-bit, or Shr i64 on AVX2):
```HighestValue<MakeSigned<TFromV<V>>>()``` is returned in the
corresponding result lanes.

* <code>bool **AllOnes**(D, V v)</code>: returns whether all bits in `v[i]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be understood that this checks if all lane values are 1.0. Would AllBits1 or AllBitsOne be more clear? Same below.

@@ -1508,6 +1522,9 @@ aligned memory at indices which are not a multiple of the vector length):

* <code>Vec&lt;D&gt; **LoadU**(D, const T* p)</code>: returns `p[i]`.

* <code>Vec&lt;D&gt; **MaskedLoadU**(D, M m, const T* p)</code>: returns `p[i]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaskedLoad already exists for this purpose, we can remove this :) Note that all the load ops except Load() itself do not require alignment.

@@ -1538,6 +1555,10 @@ aligned memory at indices which are not a multiple of the vector length):
lanes from `p` to the first (lowest-index) lanes of the result vector and
fills the remaining lanes with `no`. Like LoadN, this does not fault.

* <code> Vec&lt;D&gt; **LoadHigher**(D d, V v, T* p)</code>: Loads `Lanes(d)/2` lanes from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename to *Upper for consistency.
This could be misunderstood to load the upper half of a vector starting at p, it seems instead to load directly from p. How about something like InsertIntoUpper, and perhaps v as the last argument?

* <code>void **StoreTruncated**(Vec&lt;DFrom&gt; v, DFrom d, To* HWY_RESTRICT
p)</code>: Truncates elements of `v` to type `To` and stores on `p`. It is
similar to performing `TruncateTo` followed by `StoreN` where
`max_lanes_to_store` is `Lanes(d)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we mention the StoreN here? It seems like natural store semantics that it stores an entire vector (in this case after truncating to smaller elements), so we could remove the mention of max_lanes and "does not modify".
Generally we define the D argument that comes before a pointer to describe what is being stored. So it would be D, not DFrom here.
Also, we have CompressStore which emphasizes that Compress happens first. Seems that TruncateStore would be a more consistent naming?

// ------------------------------ SetOr/SetOrZero

#define HWY_SVE_SET_OR(BASE, CHAR, BITS, HALF, NAME, OP) \
HWY_API HWY_SVE_V(BASE, BITS) NAME(HWY_SVE_V(BASE, BITS) inactive, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename inactive to no per convention.

#endif
template <class D, typename T, class V = VFromD<D>(), HWY_IF_LANES_GT_D(D, 1)>
HWY_API V LoadHigher(D d, V a, T* p) {
const V b = LoadU(d, p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may be surprised that there's a full-length load here, this might crash.
Alternative:

Half<D> dh;
const VFromD<decltype(dh)> b = LoadU(dh, p);
return Combine(d, b, LowerHalf(a));

template <class V>
HWY_API bool AllOnes(V a) {
DFromV<V> d;
return AllTrue(d, Eq(Not(a), Zero(d)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely a bit more efficient: we can BitCast to unsigned, then compare to Set(du, hwy::HighestValue(du).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants