Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spec] Specify object lifecycle and related operations #118

Merged
merged 8 commits into from
Feb 14, 2023

Conversation

thetorpedodog
Copy link
Contributor

This formalizes the object lifecycle and the various creation, opening, and closing methods we are already using in the tiledbsoma Python implementation.

See #107.

This formalizes the object lifecycle and the various creation, opening,
and closing methods we are already using in the tiledbsoma Python
implementation.
bkmartinjr
bkmartinjr previously approved these changes Feb 9, 2023
Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: I did a very fast review. But based upon that, it looks good to me.

mlin
mlin previously approved these changes Feb 9, 2023
Copy link
Member

@mlin mlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor language tightening suggested inline

abstract_specification.md Outdated Show resolved Hide resolved
@johnkerl johnkerl changed the title Specify object lifecycle and related operations. [python] Specify object lifecycle and related operations Feb 10, 2023

SOMA objects are open for **exclusively** reading or writing.
When a SOMA object is open for writing, only metadata, the schema, and information directly derived from the schema may be read.
Additionally, for Collection objects, the members of the collection are also accessible when the collection is open in write mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say either Collection or collection but not Collection -- either it's a code reference or just common noun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

writable_collection = soma_impl.open('file:///some/collection', 'w')
dataframe = writable_collection['dataframe'] # OK
dataframe.schema # OK
dataframe.read(...) # This is NOT allowed and will cause an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd replace every 'This is NOT allowed' with more simply 'Not allowed' -- it's a little less shouty

I do get the idea of drawing attention to what they can't do -- but this entire section is about the allows/disallows. It needs no further highlighting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, our internal specs tend to always use the RFC2119 standard (i.e., consistently use NOT), but given that this doc has not already adopted it, that is probably an aspirational change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an author OUGHT TO use the caps consistently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr TIL about RFC2119! :)

If we want to go that route, sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are very good reasons for it - which is why the IETF uses it (as does many other orgs like W3C). It is a future decision which should not hold up this PR -- but coincidentally the topic was it was just raised inside CZI (regarding this very doc being unclear).

Here is an example of one of our specs that strongly benefits from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #122 -- this is awesome info -- thanks @bkmartinjr !! :)

```

```
soma_collection_t soma_collection_create(char *uri, ...) { ... }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean soma_collection_open here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going for using a different function in each language. Using a different type will make this more obvious.

abstract_specification.md Show resolved Hide resolved
writer.close()

post_close = soma_impl.open('file:///write/here')
post_close.read(...) # WILL see the data written by write_data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WILL -> will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

For example:

```
root = soma_impl.open('path://netloc/root/collection', 'w')
Copy link
Member

@johnkerl johnkerl Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admire and applaud the explicit code bits -- speaking as an example of a non-super-abstract thinker (and, I claim, a good representative of a large proportion of the general reading public -- in contrast to the small number of amazing few people who are capable of thinking/thriving solely in the abstract, who often end up doing high-level design in their professional careers) I assert that this (pesudo)code is incredibly helpful.

That said, just to raise a flag for future work -- I know because this has bit me many times in the past -- these code-as-text bits don't auto-evolve as the actual code/spec evolve over time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should eventually move to an IDL-like definition if we want the spec to be durable. But as John says, that is future work, not immediately needed (IMHO)

abstract_specification.md Show resolved Hide resolved

Parameters:

- key: The key to add the new element at. This cannot already be a key of the Collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection -> collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Parameters:

- key: The key to add the new element at. This cannot already be a key of the Collection.
- uri: An optional parameter to specify an exact URI to create the collection at. If the URI is relative, the new entry will be added as a relative URI. If the URI is absolute, the new entry will be added as an absolute URI. If not specified, the collection will generate a new URI for the entry based on a sanitized version of the key. When possible, this should be a relative URI based on the key. If a collection already exists at the user-provided URI, the operation should fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to say what relative URI does. (In particular, when we print URIs using the tiledb-py Group class, they print the same either way -- it's hard to "tell by looking".)

Namely:

  • Relative means if you recursively copy/move the storage, the parents will see the children in their new location.
  • Absolute means if you recursively copy/move the storage, the parents will see the children in their old location.

This must be stated explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added discussion of relative and absolute URIs.

| write | Write a subset of data to the SOMADataFrame |
| Operation | Description |
| ---------------------------------------- | -------------------------------------------------- |
| static create(uri, ...) -> SOMADataFrame | Create a SOMADataFrame. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re end w/ period or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@thetorpedodog thetorpedodog changed the title [python] Specify object lifecycle and related operations [spec] Specify object lifecycle and related operations. Feb 10, 2023
Copy link
Contributor Author

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of changes coming in hot


SOMA objects are open for **exclusively** reading or writing.
When a SOMA object is open for writing, only metadata, the schema, and information directly derived from the schema may be read.
Additionally, for Collection objects, the members of the collection are also accessible when the collection is open in write mode.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

```
soma_collection_t soma_collection_create(char *uri, ...) { ... }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going for using a different function in each language. Using a different type will make this more obvious.

abstract_specification.md Show resolved Hide resolved
writer.close()

post_close = soma_impl.open('file:///write/here')
post_close.read(...) # WILL see the data written by write_data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

post_close.read(...) # WILL see the data written by write_data.
```

Additionally, opened SOMA object instances are not guaranteed to see changes made by other writers after the open completed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with a slightly different wording here.

abstract_specification.md Show resolved Hide resolved

Parameters:

- key: The key to add the new element at. This cannot already be a key of the Collection.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Parameters:

- key: The key to add the new element at. This cannot already be a key of the Collection.
- uri: An optional parameter to specify an exact URI to create the collection at. If the URI is relative, the new entry will be added as a relative URI. If the URI is absolute, the new entry will be added as an absolute URI. If not specified, the collection will generate a new URI for the entry based on a sanitized version of the key. When possible, this should be a relative URI based on the key. If a collection already exists at the user-provided URI, the operation should fail.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added discussion of relative and absolute URIs.

Create a new SOMACollection with the user-specified URI.
| Operation | Description |
| --------------------------------------- | ---------------------------------------------------------------------------------------- |
| get(string key) | Get the object associated with the key |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for tables that have been changed.

| write | Write a subset of data to the SOMADataFrame |
| Operation | Description |
| ---------------------------------------- | -------------------------------------------------- |
| static create(uri, ...) -> SOMADataFrame | Create a SOMADataFrame. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@thetorpedodog
Copy link
Contributor Author

@bkmartinjr @mlin actually I would appreciate if you all took another look particularly for the description of URI semantics that I added.

@johnkerl I clicked you by accident but if you want to give it another ✅ be my guest

@thetorpedodog thetorpedodog dismissed stale reviews from mlin and bkmartinjr February 10, 2023 18:14

🆕 stuff about URIs and enums

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for a clarification around relative URI resolution in collections. Otherwise LGTM

@johnkerl johnkerl changed the title [spec] Specify object lifecycle and related operations. [spec] Specify object lifecycle and related operations Feb 10, 2023
@thetorpedodog
Copy link
Contributor Author

@atolopko-czi @aaronwolen since this is a spec change we do want unanimous approvals on this so please take a look. Thanks!

@johnkerl
Copy link
Member

@thetorpedodog ✅ ✅

Copy link
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great

Copy link
Member

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great to read the official details on open/close, read/write, etc.! A number of very minor questions and suggestions. Approving as is, since I found no blockers, but have a look at my comments before merging.

- `.collection-info`
- `absolute`: `file:///soma-dataset/collection/abspath`
- `external-absolute`: `file:///soma-dataset/more-data/other-data`
- `relative`: `relpath`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] This is a little confusing to me, since this example seems to be mixing a representation of file system directory layout with collection contents, if I understand correctly. It's understandable, but could possibly separate these two representations for clarity. "If we have this collection: ... it will be stored in the filesystem as follows: ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to represent .collection-info as a real file and present its contents differently from other paths.

abstract_specification.md Outdated Show resolved Hide resolved

| Operation | Description |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| static open(string uri, ...) | Opens the SOMA object (of the type it is attached to). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should (or does) the spec define the behavior for attempting to open an object that does not exist or cannot be read? I suppose that's implementation-specific, such as whether an exception should be thrown or nil/null/none is returned. But it would seem reasonable to specify at the spec level that a SOMAObject will not instantiated at all in such cases. Rather than getting an object that can then be inspected for its validity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I see the create() operation does mention the error case behavior, below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in the section below.


The `create` static method **creates** a new stored SOMA object, then returns that object opened in write mode.

If an object already exists at the given location, this should throw an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since data is not necessarily flushed until close, do we need to specify semantics for multiple processes attempting to create an object at the same URI in parallel? Is the "write lock" guaranteed upon the return of create() or close()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note that the creation operation should be atomic.

This completes all writes (if applicable) and releases other resources associated with this SOMA object.
Implementations of `close` must be idempotent.

If the implementation permits it, a write operation should function atomically; i.e. any other reader should not see partially-completed reads _until_ the close operation is called.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be its own section, since it involves both open and close. And I see open doesn't mention atomicity semantics. Both open and close could have "see atomicity semantics section, below", e.g.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after this it is certainly worth looking at reorganizing the way we discuss atomicity and read/write semantics; what we have is pretty complete but is scattered across multiple places.

Parameters:

- key: The key to add the new element at. This cannot already be a key of the collection.
- uri: An optional parameter to specify a URI to create the new collection, which may be [relative or absolute](#collection-entry-uris). If the URI is relative, the new entry will be added with that relative URI. If the URI is absolute, the new entry will be added with that absolute URI. If a collection already exists at the user-provided URI, the operation should fail. If the user does not specify a URI, the collection will generate a new URI for the entry. When possible, this should be a relative URI based on a sanitized version of the key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this need to mention that relative URIs cannot be "multi-level"? I think that's what the spec mentions above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that this is covered by the linked reference to how relative URIs work, and that adding something specifically here would make this already-long paragraph even longer.

`add_new_collection` has an extra parameter allowing control over the type of collection added:

```
add_new_collection(string key, CollectionType type, string uri = "", PlatformConfig platform_config) -> SOMACollection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely out-of-scope for this PR, but I'm now just wondering why we didn't consider having explicit methods for the various types of collections? add_new_experiment, add_new_measurement, add_new_collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing what my motives were back then: all collections work the same in terms of adding, so this avoided having three very similar methods. It’s not unreasonable to have one for each collection type, though.

| get schema -> Arrow.Schema | Return data schema, in the form of an Arrow Schema. |
| get index_column_names -> [string, ...] | Return index (dimension) column names. |
| get count -> int | Return the number of rows in the SOMADataFrame. |
| read | Read a subset of data from the SOMADataFrame. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read a subset of data iteratively from the SOMADataFrame? Whether or not one is reading a subset, the iterative nature of reads seems like a core part of the design, even at the spec level, given the returns type for DataFrame.read (and SparseNDArray.read())

While most people bothering to read a spec like this will understand that a subset includes the whole set, do we want to rely upon reader knowing that? "Read all or a subset..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m inclined to think we’re OK here. If this deserves clarification anywhere, it would be in the section that describes the read and write methods.

| get is_sparse -> True | Return the constant True. |
| get nnz -> uint | Return the number stored values in the array, including explicit zeros. |
| read | Read a slice of data from the SOMASparseNDArray. |
| write | Write a slice of data to the SOMASparseNDArray. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important to use slice here instead of subset as we do above? If it's a technical term we explicitly want, ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable question, but since this is a whitespace-change–only line, I think this is better handled in a followup.

@@ -608,6 +948,7 @@ Array read operations can return results in a variety of formats. The `SOMABatch
Summary:

```
open(string uri, ...) -> SOMA object # identify a SOMA object and open it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
open(string uri, ...) -> SOMA object # identify a SOMA object and open it
open(string uri, ...) -> SOMAObject # identify a SOMA object and open it

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I’m trying to imply here (maybe unsuccessfully) is that the thing you get from open is some SOMA object, and not necessarily an object whose exact type is SOMAObject.

Copy link
Contributor Author

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is it! Thank you everybody for the great feedback; it has helped a lot.

@@ -608,6 +948,7 @@ Array read operations can return results in a variety of formats. The `SOMABatch
Summary:

```
open(string uri, ...) -> SOMA object # identify a SOMA object and open it
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I’m trying to imply here (maybe unsuccessfully) is that the thing you get from open is some SOMA object, and not necessarily an object whose exact type is SOMAObject.

| get is_sparse -> True | Return the constant True. |
| get nnz -> uint | Return the number stored values in the array, including explicit zeros. |
| read | Read a slice of data from the SOMASparseNDArray. |
| write | Write a slice of data to the SOMASparseNDArray. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable question, but since this is a whitespace-change–only line, I think this is better handled in a followup.

| get schema -> Arrow.Schema | Return data schema, in the form of an Arrow Schema. |
| get index_column_names -> [string, ...] | Return index (dimension) column names. |
| get count -> int | Return the number of rows in the SOMADataFrame. |
| read | Read a subset of data from the SOMADataFrame. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m inclined to think we’re OK here. If this deserves clarification anywhere, it would be in the section that describes the read and write methods.

`add_new_collection` has an extra parameter allowing control over the type of collection added:

```
add_new_collection(string key, CollectionType type, string uri = "", PlatformConfig platform_config) -> SOMACollection
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing what my motives were back then: all collections work the same in terms of adding, so this avoided having three very similar methods. It’s not unreasonable to have one for each collection type, though.

Parameters:

- key: The key to add the new element at. This cannot already be a key of the collection.
- uri: An optional parameter to specify a URI to create the new collection, which may be [relative or absolute](#collection-entry-uris). If the URI is relative, the new entry will be added with that relative URI. If the URI is absolute, the new entry will be added with that absolute URI. If a collection already exists at the user-provided URI, the operation should fail. If the user does not specify a URI, the collection will generate a new URI for the entry. When possible, this should be a relative URI based on a sanitized version of the key.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that this is covered by the linked reference to how relative URIs work, and that adding something specifically here would make this already-long paragraph even longer.


| Operation | Description |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| static open(string uri, ...) | Opens the SOMA object (of the type it is attached to). |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in the section below.


The `create` static method **creates** a new stored SOMA object, then returns that object opened in write mode.

If an object already exists at the given location, this should throw an error.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note that the creation operation should be atomic.

This completes all writes (if applicable) and releases other resources associated with this SOMA object.
Implementations of `close` must be idempotent.

If the implementation permits it, a write operation should function atomically; i.e. any other reader should not see partially-completed reads _until_ the close operation is called.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after this it is certainly worth looking at reorganizing the way we discuss atomicity and read/write semantics; what we have is pretty complete but is scattered across multiple places.


A SOMA collection also manages the lifecycle of objects directly instantiated by it.
Objects accessed via getting a Collection element, or objects created with one of the <code>add_new\_<var>object_type</var></code> methods are considered "owned" by the collection.
All such objects will be automatically closed together by [the collection's close operation](#operation-close-collection-types).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned in the detailed description of add_new_whatever below.

- `.collection-info`
- `absolute`: `file:///soma-dataset/collection/abspath`
- `external-absolute`: `file:///soma-dataset/more-data/other-data`
- `relative`: `relpath`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to represent .collection-info as a real file and present its contents differently from other paths.

@thetorpedodog thetorpedodog merged commit 582f12e into main Feb 14, 2023
@thetorpedodog thetorpedodog deleted the object-lifecycle branch February 14, 2023 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants