Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Refactor metadata and create to respect timestamps #2180

Merged
merged 3 commits into from
Apr 9, 2024

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Mar 1, 2024

Issue and/or context:

#2182

I am refactoring the create and metadata methods in C++ after I noticed issues while working on transitioning the Python SOMADataFrame write path to use clib.DataFrame.

Changes:

  • Store read-mode Array or Group that holds metadata values valid as a class memeber
    // Metadata values need to be accessible in write mode as well. When adding
    // or deleting values in the array, instead of closing to update to
    // metadata; then reopening to read the array; and again reopening to
    // restore the array back to write mode, we just store the modifications to
    // this cache
    std::map<std::string, MetadataValue> metadata_;

    // Array associated with metadata_. We need to keep this read-mode array
    // alive in order for the metadata value pointers in the cache to be
    // accessible
    std::shared_ptr<Array> meta_cache_arr_;
  • create methods now take in timestamps which indicate when the metadata values for soma_object_type and encoding_version should be written (this is consistent to how it is done in the TileDB-SOMA API) and when the write-mode SOMAObject should be opened
  • create methods now only open Array or Group once
  • Make soma_object_type, encoding_version, and the current encoding version (1) consts
  • Use keystroke saver TimestampRange type
  • Test that SOMAObjects are writing and reading metadata correctly with timestamps
  • Test that array returned by create can be written to

Notes for Reviewer:

These changes have been pulled out of a larger branch: https://github.com/single-cell-data/TileDB-SOMA/tree/viviannguyen/array-write-path

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Merging #2180 (ca62162) into main (e0e3fa4) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2180   +/-   ##
=======================================
  Coverage   90.61%   90.61%           
=======================================
  Files          37       37           
  Lines        3900     3900           
=======================================
  Hits         3534     3534           
  Misses        366      366           
Flag Coverage Δ
python 90.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.61% <ø> (ø)
libtiledbsoma ∅ <ø> (∅)

@nguyenv nguyenv force-pushed the viviannguyen/refactor-metadata branch 5 times, most recently from 056d4b9 to bea1599 Compare March 1, 2024 04:18
@johnkerl
Copy link
Member

johnkerl commented Mar 1, 2024

This is not associated with any open GitHub issue
@nguyenv for tracking purposes, please create one.

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Looks good to me but is also rather comprehensive so may be prudent to gather another +1 before merging.

@nguyenv
Copy link
Member Author

nguyenv commented Mar 1, 2024

👍

Looks good to me but is also rather comprehensive so may be prudent to gather another +1 before merging.

Gotcha, will wait for another review.

And I'll try to break these PRs up even more next time 😅

Copy link
Collaborator

@jp-dark jp-dark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me with the caveat that there were a lot of changes, and I'm not entirely sure I understood the implications of everything. I do have two comments:

  1. Storing the metadata in memory without re-syncing after a write may lead to an in memory representation of the metadata that does not match the actual metadata on disk if there was another write. This is fine, so long as we are aware and okay with that. (This could be the case even if we did re-open the array, it's just more likely to get out of sync here.)
  2. I added a couple nits about not using or removing brackets around a single line if statement, but there were a lot of other places that the brackets were not used or removed. The convention in the core TileDB library is to always have brackets, but I didn't want to flag all of the instances here if we intentionally have a different convention in this library.

Comment on lines +725 to +729
// Metadata values need to be accessible in write mode as well. When adding
// or deleting values in the array, instead of closing to update to
// metadata; then reopening to read the array; and again reopening to
// restore the array back to write mode, we just store the modifications to
// this cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are other writes that happen simultaneously, this metadata map might not reflect any actual representation that you could obtain (even with time traveling). Are we okay with this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a better way to cache the metadata so that we deal with simulatenous writes correctly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading the data from disk is safer but slower, so it really just depends on what trade-offs we want to make.

libtiledbsoma/src/soma/soma_array.cc Outdated Show resolved Hide resolved
libtiledbsoma/src/soma/soma_array.cc Outdated Show resolved Hide resolved
libtiledbsoma/src/utils/common.h Show resolved Hide resolved
@johnkerl johnkerl changed the title [c++] Refactor metadata [c++] Refactor metadata and create to respect timestamps Mar 11, 2024
@johnkerl
Copy link
Member

johnkerl commented Apr 8, 2024

@nguyenv what else needs to happen to get this PR merged? Please let me know if/how I can help.

nguyenv added 3 commits April 8, 2024 22:43
* Store read-mode `Array` or `Group` that holds metadata values valid as
  a class memeber
* `create` methods take in timestamps which indicate when the metadata
  values for `soma_object_type` and `encoding_version` should be written
  and when the write-mode `SOMAObject` should be opened
* Make `soma_object_type` and `encoding_version` consts
* Use keystroke saver `TimestampRange`
* Refactor unit tests to reflect these changes
@nguyenv nguyenv force-pushed the viviannguyen/refactor-metadata branch from 2b355c7 to ca62162 Compare April 9, 2024 03:43
@nguyenv nguyenv merged commit 65bc19f into main Apr 9, 2024
23 checks passed
@nguyenv nguyenv deleted the viviannguyen/refactor-metadata branch April 9, 2024 04:49
@johnkerl
Copy link
Member

johnkerl commented Apr 9, 2024

Thanks @nguyenv !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c++] Refactor metadata and create to respect timestamps
5 participants