Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce feature flags to select major arrow versions #654

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented Jan 20, 2025

This change introduces arrow_53 and arrow_54 feature flags on kernel which are required when using default-engine or sync-engine. Fundamentally we must push users of the crate to select their arrow major version through flags since Cargo will include multiple major versions in the dependency tree which can cause ABI breakages when passing around symbols such as RecordBatch

See #640

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 20, 2025
@rtyler rtyler force-pushed the arrow-version-compat branch from 87f2330 to 212b392 Compare January 20, 2025 19:47
@rtyler rtyler requested review from nicklan and zachschuermann and removed request for nicklan January 20, 2025 19:58
@rtyler
Copy link
Member Author

rtyler commented Jan 20, 2025

@nicklan @zachschuermann A lot of the ❌ are due to the use of --all-features in the GitHub Actions invocations. Unfortunately this will need to be worked around since there are now mutually exclusive features present 😞

Comment on lines 50 to 60
# Arrow supported versions
## 53
# Used in default engine
arrow-buffer = { workspace = true, optional = true }
arrow-array = { workspace = true, optional = true, features = ["chrono-tz"] }
arrow-select = { workspace = true, optional = true }
arrow-arith = { workspace = true, optional = true }
arrow-cast = { workspace = true, optional = true }
arrow-json = { workspace = true, optional = true }
arrow-ord = { workspace = true, optional = true }
arrow-schema = { workspace = true, optional = true }
arrow_53 = { package = "arrow", version = "53", features = ["chrono-tz", "json", "prettyprint"], optional = true }
# Used in default and sync engine
parquet_53 = { package = "parquet", version = "53", features = ["async", "object_store"] , optional = true }
######
## 54
arrow_54 = { package = "arrow", version = "54", features = ["chrono-tz", "json", "prettyprint"], optional = true }
parquet_54 = { package = "parquet", version = "54", features = ["async", "object_store"] , optional = true }
######
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan @zachschuermann A lot of the ❌ are due to the use of --all-features in the GitHub Actions invocations. Unfortunately this will need to be worked around since there are now mutually exclusive features present 😞

Crazy idea... what if we defined the features as lower bounds instead of equality matches? So if e.g. arrow_min_54 feature is requested, that forces the version to be at least arrow-54; selecting arrow_min_53 would not cause any problems in that case, because both conditions are satisfied by choosing arrow-54. Tho maybe it would serve our purposes better to define arrow_max_53 and arrow_max_54 instead (which, if both are requested, chooses arrow-53)? Then --all-features would still compile and if somebody only specified one of those it would have the desired effect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice idea, but I don't think cargo can support it. That is, you do have to specify something like:

arrow_53 = { package = "arrow", version = "53", features = ["chrono-tz", "json", "prettyprint"], optional = true }
arrow_54 = { package = "arrow", version = "54", features = ["chrono-tz", "json", "prettyprint"], optional = true }

Somewhere to specify the optional dependencies. So now if we want a feature like arrow_max_54, we need to have that depend on.... something. What exactly it would depend on would depend on what other flags you had enabled, but afaik cargo does not support logic in the toml so you can't unify the flags and pick the right dependency.

Copy link
Collaborator

@scovich scovich Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts to this trick:

  • cargo.toml brings in arrow@53 as arrow_53 etc. This effectively bifurcates the arrow crate into multiple apparent crates, and would remain unchanged.
  • the extern crate trickery that decides which of the available versions to map back to arrow crate name. This is the part that might use the lower/upper bound concept.

But in retrospect, you're probably right... we can't use the original names to make decisions for the extern crate decl, and if we want e.g. arrow_max_54 to potentially point to arrow-53, then we'd have to bring both arrow-54 and arrow-53 as dependencies, which at last partly defeats the purpose of splitting them out.

Maybe it still works, tho? Technically both arrow versions would be dependencies, but only the one mapped in as an extern crate would actually get used by any kernel code?

(ugh, cargo... this shouldn't be so difficult!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we could do some logic to only pull in the "right" version as "arrow", but having the multiple deps in the tree feels... gross.

Agree, cargo shouldn't make this so hard!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich unfortunately @nicklan is right, there is very little support here for cleverness 😄

acceptance/src/data.rs Outdated Show resolved Hide resolved
Comment on lines +89 to +90
pub mod arrow;
pub mod parquet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to make this cleaner with some extern crate trickery:

Suggested change
pub mod arrow;
pub mod parquet;
#[cfg(feature = "arrow_53")]
macro_rules! declare_extern_arrow_and_parquet {
() => {
pub extern crate arrow_53 as arrow;
pub extern crate parquet_53 as parquet;
};
}
#[cfg(feature = "arrow_54")]
macro_rules! declare_extern_arrow_and_parquet {
() => {
pub extern crate arrow_54 as arrow;
pub extern crate parquet_54 as parquet;
};
}
declare_extern_arrow_and_parquet!();

At least, it seemed to have the desired effect when I tested a bit locally.

The macro is required because apparently extern crate is immune to conditional compilation and you get double-import compilation failures trying to do it directly... but a conditionally-defined macro that defines the extern crates works fine. 🤷

We probably need to pub export the macro so that the other kernel crates can use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that seems like a fun trick - I guess could omit our arrow/parquet modules with this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich While that might work, I would encourage us to not use macros for this usecase as I don't think it presents readability or performance benefits for a simple symbol re-export

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks reasonable :) I do wonder if we could use some of @scovich's tricks to avoid mutually-exclusive features? Though it does seem like this PR follows a somewhat typical approach of allowing major version selection of dependencies...

@@ -20,24 +20,9 @@ license = "Apache-2.0"
repository = "https://github.com/delta-io/delta-kernel-rs"
readme = "README.md"
rust-version = "1.80"
version = "0.6.1"
version = "0.7.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do the updates to version all at once during a release

Suggested change
version = "0.7.0"
version = "0.6.1"

@@ -24,7 +24,7 @@ url = "2"
delta_kernel = { path = "../kernel", default-features = false, features = [
"developer-visibility",
] }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.6.1" }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.7.0" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.7.0" }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.6.1" }

@@ -39,27 +39,29 @@ uuid = "1.10.0"
z85 = "3.0.5"

# bring in our derive macros
delta_kernel_derive = { path = "../derive-macros", version = "0.6.1" }
delta_kernel_derive = { path = "../derive-macros", version = "0.7.0" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delta_kernel_derive = { path = "../derive-macros", version = "0.7.0" }
delta_kernel_derive = { path = "../derive-macros", version = "0.6.1" }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: failed to select a version for the requirement `delta_kernel_derive = "^0.6.1"`
candidate versions found which didn't match: 0.7.0
location searched: /home/tyler/source/github/delta-io/delta-kernel-rs/derive-macros
required by package `delta_kernel v0.7.0 (/home/tyler/source/github/delta-io/delta-kernel-rs/kernel)`
```

I promise I didn't make this change willy nilly :smile:  The [multiple locations](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#multiple-locations) support in Cargo requires that the version matches the local dependency in order for it to be used, and the local version is 0.7.0 because the version is defined in the workspace and inherited by `derive-macros/Cargo.toml`

Comment on lines +1 to +2
//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow
//! This module exists to re-export the version of arrow used by default-engine and other
//! parts of kernel that need arrow

Comment on lines +89 to +90
pub mod arrow;
pub mod parquet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that seems like a fun trick - I guess could omit our arrow/parquet modules with this?

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! A couple of small things, and we should be sure ryan's idea doesn't work, but otherwise looks great.

//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow

#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, this #cfg is gonna get ugly as the number of supported arrow versions goes up. we can probably make it a little better by not using all in that case, and having multiple checks, but it's still not gonna be great. I don't think there's much better than that though sadly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to only support 2-3 versions max at a time? Tho even 3 already makes this pretty ugly... and agree we'd need to split it into multiple checks, e.g. all(v1, v2); all(v1, v3); all(v2, v3).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the plan, but if we upgrade and it doesn't need any changes in kernel there seems like there wouldn't be a need to remove an old version, so I was imagining we could go higher as long as arrow wasn't breaking things for us.

//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow

#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be the parquet flags:

Suggested change
#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
#[cfg(all(feature = "parquet_53", feature = "parquet_54"))]

same below.

But that said, do you think we ever want say arrow_53 and parquet_54? I thought they always co-versioned. Could we just have the arrow feature flags and pull the appropriate parquet crate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... would it even be safe to use arrow-54 and parquet-53 or vice-versa? Seems like we'd quickly run into ABI compat issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely not safe to use differing versions of parquet and arrow crates together, we'd run into the same symbol conflicts that led to this work in the first place 😆

@rtyler rtyler force-pushed the arrow-version-compat branch 3 times, most recently from 88a8176 to 3e56891 Compare January 25, 2025 19:06
rtyler and others added 6 commits January 25, 2025 19:13
This change introduces arrow_53 and arrow_54 feature flags on kernel
which are _required_ when using default-engine or sync-engine.
Fundamentally we must push users of the crate to select their arrow
major version through flags since Cargo _will_ include multiple major
versions in the dependency tree which can cause ABI breakages when
passing around symbols such as `RecordBatch`

See delta-io#640

Signed-off-by: R. Tyler Croy <[email protected]>
Because the arrow_53 and arrow_54 flags are mutually exclusive, we can
no longer rely on the --all-features flag 😭

Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
This now relies on the features explicitly supported by the kernel crate
rather than released arrow versions which may or may not be supported
@rtyler rtyler force-pushed the arrow-version-compat branch from 3e56891 to 64656c2 Compare January 25, 2025 19:13
Copy link

codecov bot commented Jan 25, 2025

Codecov Report

Attention: Patch coverage is 53.84615% with 12 lines in your changes missing coverage. Please review.

Project coverage is 84.42%. Comparing base (6751838) to head (64656c2).

Files with missing lines Patch % Lines
ffi/src/engine_data.rs 0.00% 5 Missing ⚠️
kernel/src/transaction.rs 69.23% 2 Missing and 2 partials ⚠️
acceptance/src/data.rs 0.00% 2 Missing ⚠️
kernel/src/engine/arrow_utils.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #654      +/-   ##
==========================================
+ Coverage   84.08%   84.42%   +0.34%     
==========================================
  Files          76       74       -2     
  Lines       17526    17113     -413     
  Branches    17526    17113     -413     
==========================================
- Hits        14736    14447     -289     
+ Misses       2077     1960     -117     
+ Partials      713      706       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants