Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Provide public methods to modify catalog #3930

Closed
ElenaKhaustova opened this issue Jun 5, 2024 · 2 comments
Closed

[DataCatalog]: Provide public methods to modify catalog #3930

ElenaKhaustova opened this issue Jun 5, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

Description

Plugin developers and advanced users face limitations due to the absence of public methods for modifying the catalog datasets, and injecting dynamic behaviour or configuration parameters on the fly during pipeline execution. Although these limitations are made intentionally by not providing corresponding public APIs users bypass them by using private APIs.

We propose to:

  1. Rethink the concept of keeping DataCatalog immutable.
  2. Explore the feasibility of providing public API for modifying the catalog datasets and configuration parameters, enabling users to adapt the pipeline's behaviour in response to changing runtime requirements or environmental conditions.

Relates to #2728

Context

  • Users need the ability to view and modify information within the Data Catalog dynamically during pipeline execution. This includes injecting dynamic data or swapping dataset implementations to accommodate varying runtime requirements.

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/framework/hooks/mlflow_hook.py#L145

Screenshot 2024-06-05 at 17 58 19

  • Plugin developers are interested in checking the dataset's type and injecting dynamic behaviour based on that type. They want to determine whether a dataset belongs to a certain class or type and then modify its parameters or behaviour accordingly, such as configuring it based on their environment or integration needs.

https://github.com/getindata/kedro-azureml/blob/d5c2011c7ed7fdc03235bf2bd6701f1901d1139c/kedro_azureml/hooks.py#L20

Screenshot 2024-06-05 at 17 37 57

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 5, 2024
@astrojuanlu
Copy link
Member

Adding a few more examples:

There's general agreement that we don't necessarily want to make all mutations of the catalog easy (like crazy injection of datasets in the middle of the lifecycle) but maybe there's more ways we can open up the collection of datasets just before the catalog is first instantiated for the rest of the run.

For interactive use on the other hand, building the DataCatalog in an imperative way seems unnecessary and there are other possibilities we can offer #3612 (comment)

@ElenaKhaustova
Copy link
Contributor Author

In the new catalog - KedroDataCatalog we implemented dict-like interface and removed _FrozenDatasets as well as access datasets like properties.

The new catalog is partially mutable as it supports a setter which allows adding new or replacing existing datasets.

We also decided with the team to not make catalog fully mutable. The datasets property remained private so as not to encourage behaviour when users configure the catalog via modifying the datasets dictionary. For the same reason KedroDataCatalog will not support all dictionary-specific methods, such as pop(), popitem(), or deletion by key (del).

It is also possible to modify the existing datasets in place as get() method returns a reference to datset object, but we do not recommend this and encourage users to be careful. These changes might affect the pipeline run and lead to unexpected results, as the framework itself doesn't track these kind of changes and does not synchronize them.

To see the full KedroDataCatalog API refer to #4175 and https://docs.kedro.org/en/stable/data/kedro_data_catalog.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Done
Development

No branches or pull requests

2 participants