-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spec] Specify object lifecycle and related operations #118
Conversation
This formalizes the object lifecycle and the various creation, opening, and closing methods we are already using in the tiledbsoma Python implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat: I did a very fast review. But based upon that, it looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor language tightening suggested inline
Co-authored-by: Mike Lin <[email protected]>
abstract_specification.md
Outdated
|
||
SOMA objects are open for **exclusively** reading or writing. | ||
When a SOMA object is open for writing, only metadata, the schema, and information directly derived from the schema may be read. | ||
Additionally, for Collection objects, the members of the collection are also accessible when the collection is open in write mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say either Collection
or collection but not Collection -- either it's a code reference or just common noun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
abstract_specification.md
Outdated
writable_collection = soma_impl.open('file:///some/collection', 'w') | ||
dataframe = writable_collection['dataframe'] # OK | ||
dataframe.schema # OK | ||
dataframe.read(...) # This is NOT allowed and will cause an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd replace every 'This is NOT allowed' with more simply 'Not allowed' -- it's a little less shouty
I do get the idea of drawing attention to what they can't do -- but this entire section is about the allows/disallows. It needs no further highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, our internal specs tend to always use the RFC2119 standard (i.e., consistently use NOT
), but given that this doc has not already adopted it, that is probably an aspirational change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an author OUGHT TO use the caps consistently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkmartinjr TIL about RFC2119! :)
If we want to go that route, sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are very good reasons for it - which is why the IETF uses it (as does many other orgs like W3C). It is a future decision which should not hold up this PR -- but coincidentally the topic was it was just raised inside CZI (regarding this very doc being unclear).
Here is an example of one of our specs that strongly benefits from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #122 -- this is awesome info -- thanks @bkmartinjr !! :)
abstract_specification.md
Outdated
``` | ||
|
||
``` | ||
soma_collection_t soma_collection_create(char *uri, ...) { ... } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean soma_collection_open
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going for using a different function in each language. Using a different type will make this more obvious.
abstract_specification.md
Outdated
writer.close() | ||
|
||
post_close = soma_impl.open('file:///write/here') | ||
post_close.read(...) # WILL see the data written by write_data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WILL -> will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
For example: | ||
|
||
``` | ||
root = soma_impl.open('path://netloc/root/collection', 'w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admire and applaud the explicit code bits -- speaking as an example of a non-super-abstract thinker (and, I claim, a good representative of a large proportion of the general reading public -- in contrast to the small number of amazing few people who are capable of thinking/thriving solely in the abstract, who often end up doing high-level design in their professional careers) I assert that this (pesudo)code is incredibly helpful.
That said, just to raise a flag for future work -- I know because this has bit me many times in the past -- these code-as-text bits don't auto-evolve as the actual code/spec evolve over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should eventually move to an IDL-like definition if we want the spec to be durable. But as John says, that is future work, not immediately needed (IMHO)
abstract_specification.md
Outdated
|
||
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the Collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collection -> collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
abstract_specification.md
Outdated
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the Collection. | ||
- uri: An optional parameter to specify an exact URI to create the collection at. If the URI is relative, the new entry will be added as a relative URI. If the URI is absolute, the new entry will be added as an absolute URI. If not specified, the collection will generate a new URI for the entry based on a sanitized version of the key. When possible, this should be a relative URI based on the key. If a collection already exists at the user-provided URI, the operation should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to say what relative URI does. (In particular, when we print URIs using the tiledb-py
Group
class, they print the same either way -- it's hard to "tell by looking".)
Namely:
- Relative means if you recursively copy/move the storage, the parents will see the children in their new location.
- Absolute means if you recursively copy/move the storage, the parents will see the children in their old location.
This must be stated explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added discussion of relative and absolute URIs.
abstract_specification.md
Outdated
| write | Write a subset of data to the SOMADataFrame | | ||
| Operation | Description | | ||
| ---------------------------------------- | -------------------------------------------------- | | ||
| static create(uri, ...) -> SOMADataFrame | Create a SOMADataFrame. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto re end w/ period or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of changes coming in hot
abstract_specification.md
Outdated
|
||
SOMA objects are open for **exclusively** reading or writing. | ||
When a SOMA object is open for writing, only metadata, the schema, and information directly derived from the schema may be read. | ||
Additionally, for Collection objects, the members of the collection are also accessible when the collection is open in write mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
abstract_specification.md
Outdated
``` | ||
|
||
``` | ||
soma_collection_t soma_collection_create(char *uri, ...) { ... } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going for using a different function in each language. Using a different type will make this more obvious.
abstract_specification.md
Outdated
writer.close() | ||
|
||
post_close = soma_impl.open('file:///write/here') | ||
post_close.read(...) # WILL see the data written by write_data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
abstract_specification.md
Outdated
post_close.read(...) # WILL see the data written by write_data. | ||
``` | ||
|
||
Additionally, opened SOMA object instances are not guaranteed to see changes made by other writers after the open completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with a slightly different wording here.
abstract_specification.md
Outdated
|
||
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the Collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
abstract_specification.md
Outdated
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the Collection. | ||
- uri: An optional parameter to specify an exact URI to create the collection at. If the URI is relative, the new entry will be added as a relative URI. If the URI is absolute, the new entry will be added as an absolute URI. If not specified, the collection will generate a new URI for the entry based on a sanitized version of the key. When possible, this should be a relative URI based on the key. If a collection already exists at the user-provided URI, the operation should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added discussion of relative and absolute URIs.
abstract_specification.md
Outdated
Create a new SOMACollection with the user-specified URI. | ||
| Operation | Description | | ||
| --------------------------------------- | ---------------------------------------------------------------------------------------- | | ||
| get(string key) | Get the object associated with the key | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for tables that have been changed.
abstract_specification.md
Outdated
| write | Write a subset of data to the SOMADataFrame | | ||
| Operation | Description | | ||
| ---------------------------------------- | -------------------------------------------------- | | ||
| static create(uri, ...) -> SOMADataFrame | Create a SOMADataFrame. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@bkmartinjr @mlin actually I would appreciate if you all took another look particularly for the description of URI semantics that I added. @johnkerl I clicked you by accident but if you want to give it another ✅ be my guest |
🆕 stuff about URIs and enums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion for a clarification around relative URI resolution in collections. Otherwise LGTM
096f546
to
0706697
Compare
@atolopko-czi @aaronwolen since this is a spec change we do want unanimous approvals on this so please take a look. Thanks! |
@thetorpedodog ✅ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great to read the official details on open/close, read/write, etc.! A number of very minor questions and suggestions. Approving as is, since I found no blockers, but have a look at my comments before merging.
abstract_specification.md
Outdated
- `.collection-info` | ||
- `absolute`: `file:///soma-dataset/collection/abspath` | ||
- `external-absolute`: `file:///soma-dataset/more-data/other-data` | ||
- `relative`: `relpath` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] This is a little confusing to me, since this example seems to be mixing a representation of file system directory layout with collection contents, if I understand correctly. It's understandable, but could possibly separate these two representations for clarity. "If we have this collection: ... it will be stored in the filesystem as follows: ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to represent .collection-info
as a real file and present its contents differently from other paths.
|
||
| Operation | Description | | ||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ | | ||
| static open(string uri, ...) | Opens the SOMA object (of the type it is attached to). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should (or does) the spec define the behavior for attempting to open an object that does not exist or cannot be read? I suppose that's implementation-specific, such as whether an exception should be thrown or nil/null/none is returned. But it would seem reasonable to specify at the spec level that a SOMAObject will not instantiated at all in such cases. Rather than getting an object that can then be inspected for its validity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I see the create()
operation does mention the error case behavior, below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note in the section below.
abstract_specification.md
Outdated
|
||
The `create` static method **creates** a new stored SOMA object, then returns that object opened in write mode. | ||
|
||
If an object already exists at the given location, this should throw an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since data is not necessarily flushed until close, do we need to specify semantics for multiple processes attempting to create an object at the same URI in parallel? Is the "write lock" guaranteed upon the return of create() or close()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note that the creation operation should be atomic.
This completes all writes (if applicable) and releases other resources associated with this SOMA object. | ||
Implementations of `close` must be idempotent. | ||
|
||
If the implementation permits it, a write operation should function atomically; i.e. any other reader should not see partially-completed reads _until_ the close operation is called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be its own section, since it involves both open and close. And I see open doesn't mention atomicity semantics. Both open and close could have "see atomicity semantics section, below", e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after this it is certainly worth looking at reorganizing the way we discuss atomicity and read/write semantics; what we have is pretty complete but is scattered across multiple places.
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the collection. | ||
- uri: An optional parameter to specify a URI to create the new collection, which may be [relative or absolute](#collection-entry-uris). If the URI is relative, the new entry will be added with that relative URI. If the URI is absolute, the new entry will be added with that absolute URI. If a collection already exists at the user-provided URI, the operation should fail. If the user does not specify a URI, the collection will generate a new URI for the entry. When possible, this should be a relative URI based on a sanitized version of the key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this need to mention that relative URIs cannot be "multi-level"? I think that's what the spec mentions above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that this is covered by the linked reference to how relative URIs work, and that adding something specifically here would make this already-long paragraph even longer.
`add_new_collection` has an extra parameter allowing control over the type of collection added: | ||
|
||
``` | ||
add_new_collection(string key, CollectionType type, string uri = "", PlatformConfig platform_config) -> SOMACollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely out-of-scope for this PR, but I'm now just wondering why we didn't consider having explicit methods for the various types of collections? add_new_experiment
, add_new_measurement
, add_new_collection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing what my motives were back then: all collections work the same in terms of adding, so this avoided having three very similar methods. It’s not unreasonable to have one for each collection type, though.
| get schema -> Arrow.Schema | Return data schema, in the form of an Arrow Schema. | | ||
| get index_column_names -> [string, ...] | Return index (dimension) column names. | | ||
| get count -> int | Return the number of rows in the SOMADataFrame. | | ||
| read | Read a subset of data from the SOMADataFrame. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read a subset of data iteratively from the SOMADataFrame
? Whether or not one is reading a subset, the iterative nature of reads seems like a core part of the design, even at the spec level, given the returns type for DataFrame.read (and SparseNDArray.read())
While most people bothering to read a spec like this will understand that a subset includes the whole set, do we want to rely upon reader knowing that? "Read all or a subset..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m inclined to think we’re OK here. If this deserves clarification anywhere, it would be in the section that describes the read and write methods.
| get is_sparse -> True | Return the constant True. | | ||
| get nnz -> uint | Return the number stored values in the array, including explicit zeros. | | ||
| read | Read a slice of data from the SOMASparseNDArray. | | ||
| write | Write a slice of data to the SOMASparseNDArray. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important to use slice
here instead of subset
as we do above? If it's a technical term we explicitly want, ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reasonable question, but since this is a whitespace-change–only line, I think this is better handled in a followup.
@@ -608,6 +948,7 @@ Array read operations can return results in a variety of formats. The `SOMABatch | |||
Summary: | |||
|
|||
``` | |||
open(string uri, ...) -> SOMA object # identify a SOMA object and open it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open(string uri, ...) -> SOMA object # identify a SOMA object and open it | |
open(string uri, ...) -> SOMAObject # identify a SOMA object and open it |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I’m trying to imply here (maybe unsuccessfully) is that the thing you get from open
is some SOMA object, and not necessarily an object whose exact type is SOMAObject
.
Co-authored-by: Andrew Tolopko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is it! Thank you everybody for the great feedback; it has helped a lot.
@@ -608,6 +948,7 @@ Array read operations can return results in a variety of formats. The `SOMABatch | |||
Summary: | |||
|
|||
``` | |||
open(string uri, ...) -> SOMA object # identify a SOMA object and open it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I’m trying to imply here (maybe unsuccessfully) is that the thing you get from open
is some SOMA object, and not necessarily an object whose exact type is SOMAObject
.
| get is_sparse -> True | Return the constant True. | | ||
| get nnz -> uint | Return the number stored values in the array, including explicit zeros. | | ||
| read | Read a slice of data from the SOMASparseNDArray. | | ||
| write | Write a slice of data to the SOMASparseNDArray. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reasonable question, but since this is a whitespace-change–only line, I think this is better handled in a followup.
| get schema -> Arrow.Schema | Return data schema, in the form of an Arrow Schema. | | ||
| get index_column_names -> [string, ...] | Return index (dimension) column names. | | ||
| get count -> int | Return the number of rows in the SOMADataFrame. | | ||
| read | Read a subset of data from the SOMADataFrame. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m inclined to think we’re OK here. If this deserves clarification anywhere, it would be in the section that describes the read and write methods.
`add_new_collection` has an extra parameter allowing control over the type of collection added: | ||
|
||
``` | ||
add_new_collection(string key, CollectionType type, string uri = "", PlatformConfig platform_config) -> SOMACollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing what my motives were back then: all collections work the same in terms of adding, so this avoided having three very similar methods. It’s not unreasonable to have one for each collection type, though.
Parameters: | ||
|
||
- key: The key to add the new element at. This cannot already be a key of the collection. | ||
- uri: An optional parameter to specify a URI to create the new collection, which may be [relative or absolute](#collection-entry-uris). If the URI is relative, the new entry will be added with that relative URI. If the URI is absolute, the new entry will be added with that absolute URI. If a collection already exists at the user-provided URI, the operation should fail. If the user does not specify a URI, the collection will generate a new URI for the entry. When possible, this should be a relative URI based on a sanitized version of the key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that this is covered by the linked reference to how relative URIs work, and that adding something specifically here would make this already-long paragraph even longer.
|
||
| Operation | Description | | ||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ | | ||
| static open(string uri, ...) | Opens the SOMA object (of the type it is attached to). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note in the section below.
abstract_specification.md
Outdated
|
||
The `create` static method **creates** a new stored SOMA object, then returns that object opened in write mode. | ||
|
||
If an object already exists at the given location, this should throw an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note that the creation operation should be atomic.
This completes all writes (if applicable) and releases other resources associated with this SOMA object. | ||
Implementations of `close` must be idempotent. | ||
|
||
If the implementation permits it, a write operation should function atomically; i.e. any other reader should not see partially-completed reads _until_ the close operation is called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after this it is certainly worth looking at reorganizing the way we discuss atomicity and read/write semantics; what we have is pretty complete but is scattered across multiple places.
|
||
A SOMA collection also manages the lifecycle of objects directly instantiated by it. | ||
Objects accessed via getting a Collection element, or objects created with one of the <code>add_new\_<var>object_type</var></code> methods are considered "owned" by the collection. | ||
All such objects will be automatically closed together by [the collection's close operation](#operation-close-collection-types). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mentioned in the detailed description of add_new_whatever
below.
abstract_specification.md
Outdated
- `.collection-info` | ||
- `absolute`: `file:///soma-dataset/collection/abspath` | ||
- `external-absolute`: `file:///soma-dataset/more-data/other-data` | ||
- `relative`: `relpath` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to represent .collection-info
as a real file and present its contents differently from other paths.
This formalizes the object lifecycle and the various creation, opening, and closing methods we are already using in the tiledbsoma Python implementation.
See #107.