Implementation plans for `pallas.dynamic_slice` and `scatter_reduce` ops #25281

olivier-peltre · 2024-12-05T11:09:11Z

olivier-peltre
Dec 5, 2024

When trying to execute a pallas kernel that calls pl.dslice(start, size) on any accelerator (GPU/TPU) I get NotImplementedError (jaxlib == 0.4.34)

Since it's already in the docs, are there any plans for a working lowering of pl.dslice to come soon?

Use case

Are there any optimisation holes / caveats to be aware of with regards to scatter_reduce ops or any suggestions to go forward?

I noticed that scatter_add scales very bad, and never managed to have indices_are_sorted=True to produce a significant difference (in previous attempts, I think I got indices_are_sorted=False in the compiled jaxpr even when passing it as keyword).

I will now try comparing with torch + rusty1s/pytorch_scatter to get an idea of the gains I could possibly hope for.

N.B. I'm looking for an efficient way to aggregate values based on a static index array, though I understand there are many constraints that may prevent very efficient dynamic scatter-reduce ops in XLA.

MW Example

def scatter_add_kernel(
    acc_ref,
    begin_ref,
    values_ref,
    out_ref,
    *,
    max_size=MAX_SIZE,
):
    """Custom kernel for efficient `scatter_add` on the last dimension.
    
    Scatter indices are sorted, therefore the task simply consists of summing
    variable-width sub-blocks of the input array, for which `begin_ref` points to 
    the corresponding slice. Although it is a static, compile-time constant, we can't 
    close over it to define the kernel. Hence `pl.dslice` seems like the only solution.
    
    Because blocks must have constant width and cannot overlap, we must 
    feed each "program" with full rows, with `begin[pl.program_id(1)]` yielding the 
    pointer to the input slice. 
    
    However the hope is that chunking data vertically would allow for better batch-size-scaling, 
    as every block of rows could be processed in parallel. 
    """
    j = pl.program_id(1)
    begin = begin_ref[...]
    values = pl.load(
        values_ref,
        (slice(None), pl.ds(begin[j], max_size)),
    )

    @pl.when(j == 0)
    def _():
        out_ref[:] = np.zeros_like(out_ref)

    def add_value(k):
        out_ref[:, j] += values[:, k]
        return k + 1

    jax.lax.while_loop(
        lambda k: k < begin[j + 1] - begin[j], 
        add_value, 
        0,
    )
    
    
def scatter_add(acc: jax.Array, idx: jax.Array, values: jax.Array) -> jax.Array:
    """Run the pallas `scatter_add_kernel`."""

    d_out = acc.shape[-1]

    # sizes by output coordinate
    zeros_out = np.zeros(d_out, dtype=np.int32)
    ones_nnz = np.ones(idx.shape[0], dtype=np.int32)
    sizes_out = zeros_out.at[idx].add(ones_nnz)

    # beginning of each slice
    begin_out = np.append(0, np.cumsum(sizes_out))

    grid_height = (values.shape[0] + BLOCK_HEIGHT - 1) // BLOCK_HEIGHT

    d_out_ = pl.next_power_of_2(d_out)
    d_val_ = pl.next_power_of_2(values.shape[-1])

    return pl.pallas_call(
        scatter_add_kernel,
        out_shape=jax.ShapeDtypeStruct(acc.shape, acc.dtype),
        grid=(grid_height, d_out),
        in_specs=[
            # acc
            pl.BlockSpec((256, d_out_), lambda i, j: (i, 0)),
            # begin_out
            pl.BlockSpec((d_out_,), lambda i, j: (0,)),
            # values
            pl.BlockSpec((256, d_val_), lambda i, j: (0, j)),
        ],
        out_specs=pl.BlockSpec((256, d_val_), lambda i, j: (i, 0)),
        interpret=True,
    )(acc, begin_out, values)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation plans for `pallas.dynamic_slice` and `scatter_reduce` ops #25281

{{title}}

Replies: 0 comments

Select a reply

Implementation plans for pallas.dynamic_slice and scatter_reduce ops #25281

olivier-peltre Dec 5, 2024

Use case

MW Example

Replies: 0 comments

Implementation plans for `pallas.dynamic_slice` and `scatter_reduce` ops #25281

olivier-peltre
Dec 5, 2024