-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier CustomDataset Creation #1936
Comments
Hi @lancechua thank you for raising this! We have discussed this as a team and have had a good conversation regarding ways we can proceed. You will see that this issue has now been referenced in the #1778 super issue which is tracking all of the ways we plan to improve the Your last point:
Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original The team are going to have a go at prototyping what this could look like, so you may be interested in subscribing to parent issue to contribute to these design decisions. The initial thinking reflected on the rather flat structure we have today where everything inherits from |
I feel like too much copy-pasting has been an issue for a really long time. 😅 I would be keen to break down the points of duplication (e.g. default argument handling, |
Collecting some thoughts from our latest Tech Design session about this topic: Recent developments
General thoughts
Current issues
Other links and ideas
|
A user that quite likely got confused when defining their own dataset, most likely because they aren't copy-pasting enough https://www.linen.dev/s/kedro/t/12646523/hi-folks-i-m-trying-to-create-a-custom-kedro-dataset-inherit#ac22d4cb-dd72-4072-905a-f9c5cb352205 |
I'm also just dropping data from some of our surveys here, the question is "What is the hardest part of working with the Kedro framework?" (Kedro - User Research Study)
|
Interesting use case: a user trying to perform data access with a highly customized This is not too different from my |
Channelling my inner @idanov this does harm reproducibility - personally I think it's a good thing, same for conditional nodes, another things we've held off doing for the same reason. |
This is a bit of a tangent, but I think it's worth addressing. The reality is that people who need "weird" things will do them, whether are supported by Kedro or not. This not only applies to datasets: I've seen people do very weird things with Kedro. Speaking of datasets, the options our users have are:
There are 2 stances we can take:
I'm firmly in camp (2). |
Completely aligned on kedro-org/kedro-plugins#2 I think it's related to the wider rigidity questions we have in the framework vs lib debate. FYI - how prefect do conditionals here looks neat |
Another user who wants to read a TSV file contained in a pandas does not support compressed archives with more than one file, so this requires an intermediate step. |
After implementing some new datasets that will hopefully be open sourced soon, I'm adding some more reflections here. I think that the difficult part lies in three specific points:
(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes (2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the Another reason why this is hard is the naming itself: some things we name "datasets" in Kedro are not really a /ˈdā-tə-ˌset/ in the strict sense of the word, but more like "artifacts" or "connectors". I think there is an opportunity here to collect some of the "weird" use cases we already have and come up with better terminology. |
Can't remember who linked it in the issues here somewhere, but after reading the Python subclassing redux article, I also think composition is the better way to go by injecting
Also agree. The callables can be fully specified by From what I remember about the implementation of versioning, the filename becomes a directory, with timestamp sub-directories. A separate A use case that would be good, but potentially annoying to implement, is a
The The
The no caching case would just always dispatch to the original P.S. Haven't worked on kedro datasets for a while now, but hopefully the above still makes sense. |
Thanks a lot for getting back @lancechua! Yes, Hynek has influenced me a lot on this as well. Hope we can prioritize this in 2024. |
More evidence of different artifact types: Kubeflow https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/#artifact-types (via @deepyaman) |
Another problem with the current design: when doing post-mortem debugging on datasets, one gets thrown to Lines 191 to 201 in 4d186ef
|
Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes? |
I proposed it in another issue and @antonymilne made some good remarks #2409 (comment) I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary. |
Progress towards making
Originally posted by @deepyaman in #3920 (comment) |
Also, interesting perspective on object stores vs filesystems: https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
Originally shared by @MatthiasRoels in kedro-org/kedro-plugins#625 (comment) |
@datajoely @astrojuanlu @antonymilne I feel a lot less like trailblazer after coming across these comments again, but I've been doing a lot more work/thinking about datasets and the data catalog, and many of my thoughts are along these lines. I agree protocols make a lot of sense for datasets. This would get us 90% of the way there to enabling use of the data catalog without depending on Kedro, and for datasets to be implemented without necessarily needing to subclass anything. The biggest barrier is the wrapping of load and save functionality; I haven't found any way to wrap a method with stuff like error handling and path modification code like we do. However, through my recent work, I question more whether wrapping is even a good design (as I think some of you may have, too). For one, there are a lot of instances in the tests and dataset code where you access the How can we avoid this? Rather than modifying the This approach has numerous other benefits:
Of course, catalogs, too, will need to follow their own protocol, but that's a separate (perhaps simpler?) story. :) In any case, think all of this is definitely worth revisiting, given that Data Catalog 2.0 is actually a thing people are (doing more than) talking about now! |
Almost 1 year after I wrote #1936 (comment), and in line with what @deepyaman is saying, I think I can articulate better what the problem is: there's a double wrapping.
kedro/kedro/io/data_catalog.py Line 546 in 94ae223
and Lines 192 to 193 in 011e3a8
(pasting the pre-0.19.7 logic because the current one is not very easy to follow) I agree with @deepyaman that this is bad design. To me, it's clear that one of those wrappings is unnecessary. Datasets should implement a PEP 544 How to do this transition in an orderly fashion is another story. |
Also I came here to say two more things:
|
That’s exactly what I always do: I use credentials as env vars. however, it would still be nice to be able to inject them in a dataset (think db credentials for sql datasets) in some way. As for the other comments, I really like the ideas and I think it would dramatically simplify the setup. I would even go as far as stripping down catalog/datasets to the bare minimum:
This will certainly be a series of breaking changes but should hopefully simplify catalog setup and perhaps decouple catalog from kedro altogether. Perhaps a separate package with no dependency on kedro, but core kedro depends on the catalog package? |
I've read the (very interesting) article provided by @antonymilne above, and I am not absolutely sure of this sentence :
I guess that on the contrary of the bold sentence (Definitely agree with the rest of the sentence though), the article suggests to create an extra layer of abstraction with a class to split the Disclaimer: please ignore the terrible naming.
# public_dataset_protocol.py
# https://hynek.me/articles/python-subclassing-redux/#case-study
from typing import Any, Protocol, runtime_checkable
# belong to kedro, it is the core interface
# but we want people to be able to declare it without importing kedro
# The public interface is no longer AbstractDataset or Dataset
# runtime_checkable decorator is mandatory to be able to check with "isinstance"
@runtime_checkable
class ConnectorProtocol(Protocol):
def load(self) -> None: ...
def save(self, data: Any) -> None: ...
# internal_dataset_class.py
from logging import getLogger
from typing import Any, Dict
from dataset_protocol.public_dataset_protocol import ConnectorProtocol
# belong to kedro, it is the core interface
# but this is intended to be used internally
# all our internal wrapping with logging and other stuff should be invisible to the end user
class Dataset:
def __init__(self, connector: ConnectorProtocol) -> None:
# we must check on framework side that the argument is indeed a connector
# that's why we needed to declare the protocol "runtime_checkable"
if not isinstance(connector, ConnectorProtocol):
raise ValueError("connector must be a ConnectorProtocol")
self._connector = connector
self._logger = getLogger(__name__)
def load(self) -> None:
# this class wraps the public "load" and add all custom logic
self._logger.warning("Loading data...")
return self._connector.load()
def save(self, data: Any) -> None:
# this class wraps the public "load" and add all custom logic
self._logger.warning("Saving data...")
return self._connector.save(data)
# belong to kedro, it is the core interface
# but this is intended to be used internally
class DataCatalog:
def __init__(self, datasets: Dict[str, Dataset]) -> None:
self.datasets = datasets
def load(self, name: str) -> None:
# like @astrojuanlu I am not convinced that Datacalog needs to redefine load and save
# if one can call catalog.datasets[name].load() directly, but this is another topic
return self.datasets[name].load()
def save(self, name: str, data: Any) -> None:
return self.datasets[name].save(data)
# external_custom_connector.py
import pandas as pd
class GoodCustomConnector:
def __init__(self, filepath: str) -> None:
self._filepath = filepath
def load(self) -> pd.DataFrame:
return pd.read_csv(filepath_or_buffer=self._filepath)
def save(self, data: pd.DataFrame) -> None:
return data.to_csv(path_or_buf=self._filepath)
class BadCustomConnector:
# no save method declared!
def __init__(self, filepath: str) -> None:
self._filepath = filepath
def load(self) -> pd.DataFrame:
return pd.read_csv(filepath_or_buffer=self._filepath)
from dataset_protocol.external_custom_connector import (
BadCustomConnector,
GoodCustomConnector,
)
from dataset_protocol.internal_dataset_class import Dataset
good_ds = Dataset(
GoodCustomConnector(
filepath=r"data\01_raw\companies.csv"
)
)
good_ds.load() # Loading data...
bad_ds = Dataset(
BadCustomConnector(
filepath=r"data\01_raw\companies.csv"
)
) # ValueError: connector must be a ConnectorProtocol -> The error message is quite bad, we likely want something more insightful The experience from the motorbikes:
type: pandas.CSVDataset # except that it is a ConnectorProtocol, not a Dataset. The associated dataset will be created internally at parsing time
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3 This would enable much simpler composability , because I could create a connector that wraps connector (like MlflowAbstractDataset, or like @deepyaman load and save wrapping), and still have the associated Dataset properly working. |
I guess my immediate question is what does |
class MlflowArtifactConnector:
def __init__(self, connector: Connector) -> None:
self.connector = connector
def save(self, data: pd.DataFrame) -> None:
self.connector.save(data)
mlflow.log_artifact(self.connector._filepath) Which is very hard to do currently because we need to fake inheritance. |
@Galileo-Galilei: this stil doesn’t answer @astrojuanlu’s question. With a complete redesign of datasets and catalog, why would not need a third series of classes? In your examples, it looks like it is sufficient to have the |
To better describe my proposal:
class DataCatalog2:
def load(self, name: str, version: str | None = None) -> Any:
"""Loads a registered data set."""
...
try:
result = dataset.load()
except Exception as exc:
message = (
f"Failed while loading data from data set {str(self)}.\n{str(exc)}"
)
raise DatasetError(message) from exc
class DatasetProtocol(Protocol):
# metadata: dict[str, t.Any] # No prescribed properties, maybe only metadata?
def load(self) -> t.Any: ...
def save(self, data: t.Any) -> None: ...
...
@dataclass
class MyPolarsDataset: # Structural subtyping means no need for inheritance
filepath: str
metadata: dict[str, t.Any]
def load(self) -> pl.DataFrame:
...
return df Am I missing something? |
xref #3995 (comment) |
Description
IO read write functions typically follow this signature:
Creating custom datasets should ideally be as easy as supplying
load
/save
function(s) that follow this function signature.Context
kedro
supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities likexarray
andarviz
aren't currently supported.Beyond these examples, unofficially adding support for more obscure datasets would be easier.
Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense.
Then, @datajoely suggested I open a feature request.
Possible Implementation
We can consider a
Dataset
class factory. MaybeGenericDataset
with a class method.create_custom_dataset(name: str, load: callable, save: callable)
.Usage would look something like
xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs))
.Entries can be added to the data catalog yaml just as with any other custom dataset implementation.
Possible Alternatives
LambdaDataset
is very similar but theload
, andsave
are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm awareAbstractDataset
is an option, but this feature request seeks to reduce boilerplate when defining new datasetsThe text was updated successfully, but these errors were encountered: