-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC on soma_joinid
: identifier or index?
#164
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,25 @@ | ||||||
# `soma_joinid`: identifier or index? (May 2023) | ||||||
|
||||||
The SOMA specification introduces `soma_joinid` like so: | ||||||
|
||||||
> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`. | ||||||
|
||||||
Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame. | ||||||
|
||||||
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Only if it is dense do they necessarily become indices. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree with this. For sparse arrays (dataframe, sparsendarray), Common usage will use a contiguous range, but the spec absolutely (and intentionally) doesn't require that. The one nuance here -- if you have a sparse X in an Experiment, you will only have implicit zeros where there is a defined soma_joinid pair in the obs/var DataFrame. In other words, any soma_dim_N not in the sparse array does not exist in the X index domain unless the joinid is defined in the corresponding obs/var dataframe. Dense is another matter, and still has unresolved design issues due to the fact that we allow "sparse" joinid domains in the obs/var dataframes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bkmartinjr @johnkerl We need to find the precise terminology to convey the relationship between |
||||||
|
||||||
This dual nature of `soma_joinid` is easy to miss, and has caused our team at least two significant design dilemmas: | ||||||
|
||||||
## 1. Implementations with one-based vector/matrix indexing | ||||||
|
||||||
Languages like R conventionally use one-based indexing for vectors and matrices. If a given `obs`/`var` item has `soma_joinid = 0`, it cannot be directly represented on a one-based `X` matrix dimension. If we get around that by populating the R implementation of an `X` matrix with `soma_joinid+1`, then the row & column numbers no longer correspond to the `soma_joinid` in the `obs`/`var` data frames -- making the "join" very error-prone. | ||||||
|
||||||
We addressed this by [introducing a zero-based wrapper](https://github.com/single-cell-data/TileDB-SOMA/pull/1313) for the R matrix implementation, so that `W[i,j]` would access `M[i+1,j+1]` and these index values correspond to the data frame entries as expected. This requires clear signalling and documentation, for example naming the method `SOMASparseNDArray$read_sparse_matrix_zero_based()` and providing an `as.one.based(W)` accessor to the underlying R matrix. | ||||||
|
||||||
[Alternatives discussed](https://github.com/single-cell-data/TileDB-SOMA/issues/1232) included excluding zero from the domain of `soma_joinid`, or even fully redefining `soma_joinid` as an index; but these would have been significant breaking changes to the SOMA specification after releasing version 1.0. | ||||||
|
||||||
## 2. Dense array indexing | ||||||
|
||||||
The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is also problematic for other reasons (see above). By not requiring the axis dataframes to define contiguous joinid domains, we implicitly make them incompatible with dense matrices |
||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the definition of index confusing - there is nothing about an index that requires it to be contiguous, starting at zero/one, etc.
I think you want to separate the concept of an "index" (aka key) from "offset" (in Pandas terminology, you could have a string index, but you always have an integer
iloc
(offset)).Using index in this way confuses the DB organization with the in-memory representation.