Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC on soma_joinid: identifier or index? #164

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions rfcs/soma_joinid_id_or_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# `soma_joinid`: identifier or index? (May 2023)

The SOMA specification introduces `soma_joinid` like so:

> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`.

Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the definition of index confusing - there is nothing about an index that requires it to be contiguous, starting at zero/one, etc.

I think you want to separate the concept of an "index" (aka key) from "offset" (in Pandas terminology, you could have a string index, but you always have an integer iloc (offset)).

Using index in this way confuses the DB organization with the in-memory representation.


**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.
**However**, beyond this primary definition, `soma_joinid` is also used as the row and column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X is sparse then row and column numbers are not necessarily indices.

Only if it is dense do they necessarily become indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. For sparse arrays (dataframe, sparsendarray), soma_dim_N is not an index, in the way you define it above (i.e. a physical offset). It is an identifier, just like the soma_joinid. The only constraints on soma_joinid/soma_dim_N are that they are uint64. They can be in any order, be non-contiguous, and span any range.

Common usage will use a contiguous range, but the spec absolutely (and intentionally) doesn't require that.

The one nuance here -- if you have a sparse X in an Experiment, you will only have implicit zeros where there is a defined soma_joinid pair in the obs/var DataFrame. In other words, any soma_dim_N not in the sparse array does not exist in the X index domain unless the joinid is defined in the corresponding obs/var dataframe.

Dense is another matter, and still has unresolved design issues due to the fact that we allow "sparse" joinid domains in the obs/var dataframes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr @johnkerl We need to find the precise terminology to convey the relationship between soma_joinid (not an index) with the row/column numbers of a matrix (obviously indexed sequences; whether the matrix is represented sparsely is an implementation detail, in my view). The original draft is too strong in equating them, admitted, but we need to convey that the treatment of soma_joinid as matrix row/column numbers in a certain context imposes certain constraints on its initial, broad definition -- that is probably the most important point of the whole essay.


This dual nature of `soma_joinid` is easy to miss, and has caused our team at least two significant design dilemmas:

## 1. Implementations with one-based vector/matrix indexing

Languages like R conventionally use one-based indexing for vectors and matrices. If a given `obs`/`var` item has `soma_joinid = 0`, it cannot be directly represented on a one-based `X` matrix dimension. If we get around that by populating the R implementation of an `X` matrix with `soma_joinid+1`, then the row & column numbers no longer correspond to the `soma_joinid` in the `obs`/`var` data frames -- making the "join" very error-prone.

We addressed this by [introducing a zero-based wrapper](https://github.com/single-cell-data/TileDB-SOMA/pull/1313) for the R matrix implementation, so that `W[i,j]` would access `M[i+1,j+1]` and these index values correspond to the data frame entries as expected. This requires clear signalling and documentation, for example naming the method `SOMASparseNDArray$read_sparse_matrix_zero_based()` and providing an `as.one.based(W)` accessor to the underlying R matrix.

[Alternatives discussed](https://github.com/single-cell-data/TileDB-SOMA/issues/1232) included excluding zero from the domain of `soma_joinid`, or even fully redefining `soma_joinid` as an index; but these would have been significant breaking changes to the SOMA specification after releasing version 1.0.

## 2. Dense array indexing

The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also problematic for other reasons (see above). By not requiring the axis dataframes to define contiguous joinid domains, we implicitly make them incompatible with dense matrices