-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC on soma_joinid
: identifier or index?
#164
base: main
Are you sure you want to change the base?
Conversation
|
||
Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame. | ||
|
||
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. | |
**However**, beyond this primary definition, `soma_joinid` is also used as the row and column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If X
is sparse then row and column numbers are not necessarily indices.
Only if it is dense do they necessarily become indices.
|
||
> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`. | ||
|
||
Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the definition of index confusing - there is nothing about an index that requires it to be contiguous, starting at zero/one, etc.
I think you want to separate the concept of an "index" (aka key) from "offset" (in Pandas terminology, you could have a string index, but you always have an integer iloc
(offset)).
Using index in this way confuses the DB organization with the in-memory representation.
|
||
Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame. | ||
|
||
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this. For sparse arrays (dataframe, sparsendarray), soma_dim_N
is not an index, in the way you define it above (i.e. a physical offset). It is an identifier, just like the soma_joinid. The only constraints on soma_joinid/soma_dim_N are that they are uint64. They can be in any order, be non-contiguous, and span any range.
Common usage will use a contiguous range, but the spec absolutely (and intentionally) doesn't require that.
The one nuance here -- if you have a sparse X in an Experiment, you will only have implicit zeros where there is a defined soma_joinid pair in the obs/var DataFrame. In other words, any soma_dim_N not in the sparse array does not exist in the X index domain unless the joinid is defined in the corresponding obs/var dataframe.
Dense is another matter, and still has unresolved design issues due to the fact that we allow "sparse" joinid domains in the obs/var dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkmartinjr @johnkerl We need to find the precise terminology to convey the relationship between soma_joinid
(not an index) with the row/column numbers of a matrix (obviously indexed sequences; whether the matrix is represented sparsely is an implementation detail, in my view). The original draft is too strong in equating them, admitted, but we need to convey that the treatment of soma_joinid
as matrix row/column numbers in a certain context imposes certain constraints on its initial, broad definition -- that is probably the most important point of the whole essay.
|
||
## 2. Dense array indexing | ||
|
||
The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is also problematic for other reasons (see above). By not requiring the axis dataframes to define contiguous joinid domains, we implicitly make them incompatible with dense matrices
ps. @mlin - re-reading my comments, this is complex enough topic that it may warrant a real-time conversation. I'm available if you think this would be useful. |
two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):
|
Preview link: https://github.com/single-cell-data/SOMA/blob/mlin/rfc-on-r-soma-joinid/rfcs/soma_joinid_id_or_index.md