Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++, r] Update and refactor nanoarrow #2188

Merged
merged 39 commits into from
Apr 3, 2024

Conversation

eddelbuettel
Copy link
Contributor

@eddelbuettel eddelbuettel commented Mar 1, 2024

Issue and/or context:

We rely on nanoarrow to take advantage of higher-level structures for the (otherwise 'naked' void * of the C API of Arrow. nanoarrow helps here, and has made good strides in crecent releases.

Changes:

We update the 'vendored' nanoarrow to release 0.4.0, and make extended use of it. This works well: simple examples work fine and show the desired memory behaviour. However, on larger examples we see interactions with (local) handling of Arrow objects in libtiledbsoma so the scope of the PR will have to extended to handle allocating, releases and setting up Arrow objects via nanoarrow.

Notes for Reviewer:

This PR is informational to run some CI; all local tests pass. However, at present an invasice change in libtiledbsoma makes this unsuitable for a merge at present meaning it is also not yet fully ready for a review.
The secret cabal powers of CI are currently smiling upon us (after a lot of help from several people, and a last minute bug fix) so the 'draft' label is now off. There are remaining 'drafty' things here (such as a move of nanoarrow.* to a directory below external/ but we can do so in a later PR.

Copy link

This pull request has been linked to Shortcut Story #42316: [r] Refactor to make use of nanoarrow.

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Merging #2188 (7bb42fb) into main (b34e727) will decrease coverage by 11.48%.
Report is 1 commits behind head on main.
The diff coverage is 7.30%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2188       +/-   ##
===========================================
- Coverage   77.97%   66.49%   -11.48%     
===========================================
  Files         140      142        +2     
  Lines       10855    12743     +1888     
  Branches      217      511      +294     
===========================================
+ Hits         8464     8474       +10     
- Misses       2303     4169     +1866     
- Partials       88      100       +12     
Flag Coverage Δ
libtiledbsoma 38.85% <0.17%> (-24.77%) ⬇️
python 90.59% <ø> (ø)
r 72.80% <65.21%> (-2.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.59% <ø> (ø)
libtiledbsoma 19.75% <0.17%> (-28.45%) ⬇️

@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch 7 times, most recently from dc7819d to 4964ec1 Compare March 8, 2024 17:58
@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch from 649dc55 to ab3f6d0 Compare March 15, 2024 17:21
@nguyenv nguyenv force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch from b341e0e to ab3f6d0 Compare March 15, 2024 21:18
@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch 2 times, most recently from a73b741 to 52a2f4a Compare March 20, 2024 02:56
@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch from 21110d1 to bed7757 Compare March 27, 2024 12:36
@eddelbuettel eddelbuettel changed the title [r] [WIP - Do Not Merge Yet] Update and refactor nanoarrow [r] Update and refactor nanoarrow Mar 27, 2024
@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch 2 times, most recently from 2a27207 to 5f7e0e0 Compare April 1, 2024 19:39
@eddelbuettel
Copy link
Contributor Author

The dual nanoarrows (from inside the R package, and inside libtiledbsoma) can likely be remedied. The location in tiledbsoma should be in externals/ anyway -- a follow-up PR seems apt.

@johnkerl
Copy link
Member

johnkerl commented Apr 2, 2024

The dual nanoarrows (from inside the R package, and inside libtiledbsoma) can likely be remedied. The location in tiledbsoma should be in externals/ anyway -- a follow-up PR seems apt.

I created #2362 to track this

@eddelbuettel
Copy link
Contributor Author

@johnkerl One triplet of nanoarrow files is now gone.

@eddelbuettel eddelbuettel changed the title [r] Update and refactor nanoarrow [c++, r] Update and refactor nanoarrow Apr 2, 2024
apis/r/src/rutilities.h Outdated Show resolved Hide resolved
apis/r/src/rutilities.h Outdated Show resolved Hide resolved
@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch from 81ee27a to 5450dfe Compare April 2, 2024 22:45
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@johnkerl
Copy link
Member

johnkerl commented Apr 2, 2024

Thanks @eddelbuettel !!

Just to set an expectation, this may not backport automatically to release-1.9 since apis/r/DESCRIPTION is touched and that differs between main and release-1.9 -- not to worry, I'll do a manual backport tomorrow morning if so

@@ -108,17 +154,20 @@ std::unique_ptr<ArrowSchema> ArrowAdapter::arrow_schema_from_tiledb_array(
auto nattr = tiledb_schema.attribute_num();

std::unique_ptr<ArrowSchema> arrow_schema = std::make_unique<ArrowSchema>();
arrow_schema->format = "+s";
arrow_schema->format = strdup("+s");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm the one who added this but I'm not sure that it's necessary here. I may have overcorrected.

Copy link
Contributor Author

@eddelbuettel eddelbuettel Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is very much needed. It is just stylistic to use strdup() which allocates. If you peek into what nanoarrow does it actually allocates explicitly with malloc (as do I over in tiledb-r)

ArrowErrorCode ArrowSchemaSetFormat(struct ArrowSchema* schema, const char* format) {
  if (schema->format != NULL) {
    ArrowFree((void*)schema->format);
  }

  if (format != NULL) {
    size_t format_size = strlen(format) + 1;
    schema->format = (const char*)ArrowMalloc(format_size);
    if (schema->format == NULL) {
      return ENOMEM;
    }

    memcpy((void*)schema->format, format, format_size);
  } else {
    schema->format = NULL;
  }

  return NANOARROW_OK;
}

This is different from what it also has 'view' inits for schema which just borrow the pointed to memory because the contract the viewer can assume the viewed object exists.

In short, your strdup() was fine. And after 30+ plus years of C it possibly taught me another K+R (at style) function so all good!

Also note that if you peek at the matching release function they free() so there is an assumption of allocated memory here.

return NANOARROW_TYPE_LARGE_BINARY;
else
throw TileDBSOMAError(fmt::format(
"ArrowAdapter: Unsupported TileDB datatype string: {} ", sv));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nanoarrow format strings here, not TileDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Arrow format string (which is the same as Arrow). Will adjust.

@eddelbuettel eddelbuettel force-pushed the de/sc-42316/update_and_refactor_nanoarrow branch from a27a233 to 7bb42fb Compare April 2, 2024 23:44
@eddelbuettel eddelbuettel merged commit aa0adcb into main Apr 3, 2024
15 checks passed
@eddelbuettel eddelbuettel deleted the de/sc-42316/update_and_refactor_nanoarrow branch April 3, 2024 01:32
github-actions bot pushed a commit that referenced this pull request Apr 3, 2024
* Update nanoarrow vendored files to nanoarrow 0.4.0

* Low-level wiring of nanoarrow at sr_* level

* More lower-level wiring of nanoarrow

* WIP snapshot with nanoarrow wired into libtiledbsoma

* Ensure nullable is set correctly in either case

* Context wrapped in a special purpose struct should not finalize

* Simpler and faster r-ci.yaml

* Use nanoarrow 0.4.0 consistently

* Refined arrow_adapter

* Set increased timeout for download.file to survive GH flakyness

* Turn trace back of, do not include carrow in cli

* Do not include carrow.h in reindexer.cc

* WIP changes expanding type map, suppressing schema release

* [c++] Fix segfault issues

* Add additional necessary strdup

* No longer to protect one statement

* Support TILEDB_DATETIME_DAY aka Date as well

* Meh

* Meh with version 14.0.0 and not 14.0.6 because ... sure

* Remove initialization setters covered by nanoarrow use

* Ensure DATETIME columns get Arrow coltype reset

* Add more date and datetime support

* Additional conversion

* Post-rebase change

* Heeding time to the lord of linting is time well spent some say

* Heeding time to the lord of linting is time well spent some say

* Correct another delete to free

* Additional non-nullptr protection

* make format

* Additional test conditioner

* Correcting one buffer size selection

* make format

* Remove carrow.h and reference to it

* Cleanups

* Use nanoarrow.{c,hpp} via tiledbsoma/utils/

* Re-activate -Werror

* Chore

* High-productivity afternoon

* Correct an format string error message

---------

Co-authored-by: Vivian Nguyen <[email protected]>
johnkerl pushed a commit that referenced this pull request Apr 3, 2024
* Update nanoarrow vendored files to nanoarrow 0.4.0

* Low-level wiring of nanoarrow at sr_* level

* More lower-level wiring of nanoarrow

* WIP snapshot with nanoarrow wired into libtiledbsoma

* Ensure nullable is set correctly in either case

* Context wrapped in a special purpose struct should not finalize

* Simpler and faster r-ci.yaml

* Use nanoarrow 0.4.0 consistently

* Refined arrow_adapter

* Set increased timeout for download.file to survive GH flakyness

* Turn trace back of, do not include carrow in cli

* Do not include carrow.h in reindexer.cc

* WIP changes expanding type map, suppressing schema release

* [c++] Fix segfault issues

* Add additional necessary strdup

* No longer to protect one statement

* Support TILEDB_DATETIME_DAY aka Date as well

* Meh

* Meh with version 14.0.0 and not 14.0.6 because ... sure

* Remove initialization setters covered by nanoarrow use

* Ensure DATETIME columns get Arrow coltype reset

* Add more date and datetime support

* Additional conversion

* Post-rebase change

* Heeding time to the lord of linting is time well spent some say

* Heeding time to the lord of linting is time well spent some say

* Correct another delete to free

* Additional non-nullptr protection

* make format

* Additional test conditioner

* Correcting one buffer size selection

* make format

* Remove carrow.h and reference to it

* Cleanups

* Use nanoarrow.{c,hpp} via tiledbsoma/utils/

* Re-activate -Werror

* Chore

* High-productivity afternoon

* Correct an format string error message

---------

Co-authored-by: Dirk Eddelbuettel <[email protected]>
Co-authored-by: Vivian Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants