Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate datasets versions #4347

Merged
merged 17 commits into from
Nov 28, 2024
Merged

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Nov 22, 2024

Description

Solves #4327

Merge before #4329

Development notes

Added _validate_versions function to ensure all datasets in a catalog adhere to a versioning scheme - we allow single load version per dataset in the catalog and one save version for all datasets in the catalog. The function automatically updates the provided load versions based on the versions specified for the individual datasets. It also ensures all versioned datasets in the catalog share the same save version. If a conflict arises, a VersionAlreadyExistsError is raised.

Validation is applied to both DataCatalog and KedroDataCatalog when a catalog is created or the dataset is added.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

ElenaKhaustova and others added 14 commits November 21, 2024 17:11
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova mentioned this pull request Nov 27, 2024
10 tasks
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review November 27, 2024 14:30
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and all makes sense 👍

One thing I was wondering is if this could somehow break existing workflows that use the DataCatalog?

kedro/io/core.py Outdated Show resolved Hide resolved
kedro/io/core.py Outdated Show resolved Hide resolved
kedro/io/core.py Outdated Show resolved Hide resolved
@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Nov 27, 2024

LGTM, and all makes sense 👍

One thing I was wondering is if this could somehow break existing workflows that use the DataCatalog?

That's a good question. Technically, we always required DataCatalog with versioned datasets to have only one save version. But we never validated it, and this requirement can be bypassed. So we consider this change as a fix—that's why we applied it for both old and new catalogs. We also think this shouldn't be the common case, but even if some workflows break, that's because people used versioning not as expected.

The alternative is to apply these changes just to KedroDataCatalog.

@merelcht
Copy link
Member

That's a good question. Technically, we always required DataCatalog with versioned datasets to have only one save version. But we never validated it, and this requirement can be bypassed. So we consider this change as a fix—that's why we applied it for both old and new catalogs. We also think this shouldn't be the common case, but even if some workflows break, that's because people used versioning not as expected.

The alternative is to apply these changes just to KedroDataCatalog.

Yes so technically speaking this is breaking for people who have been passing instantiated dataset objects to catalog constructor with different save versions. It's a very rare thing to do and I agree this is actually improving the expected behaviour. So let's just keep it like this, but then at least if someone does complain we know what's going on.

Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@ElenaKhaustova ElenaKhaustova merged commit 7b24af7 into main Nov 28, 2024
41 checks passed
@ElenaKhaustova ElenaKhaustova deleted the fix/4327-validate-datasets-versions branch November 28, 2024 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants