From 241fb59e049c316d3b72b5e4eb853fb786a56221 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 11:27:20 +0100 Subject: [PATCH 01/23] Added setting default catalog section Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 12 +++++++++++- docs/source/data/kedro_data_catalog.md | 18 ++++++++++++++++++ 2 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 docs/source/data/kedro_data_catalog.md diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 18edfc1ab9..d9dcf265aa 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -1,5 +1,5 @@ -# The Kedro Data Catalog +# Data Catalog In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class. @@ -46,3 +46,13 @@ This section on handing data with Kedro concludes with an advanced use case, ill how_to_create_a_custom_dataset ``` + +From Kedro 0.19.0 you can use an experimental feature - `KedroDataCatalog` instead of `DataCatalog`. + +Currently, it repeats `DataCatalog` functionality and fully compatible with Kedro `run` with a few API enhancements: + * Removed `_FrozenDatasets` and access datasets as properties; + * `KedroDataCatalog` supports dict-like interface to get/set datasets and iterate through them. + +A separate page of [Kedro Data Catalog](./kedro_data_catalog.md) shows `KedroDataCatalog` usage examples, and it's new API but all the information provided for the `DataCatalog` is relevant as well. + +It is an experimental feature and is under active development. Though all the new catalog features will be released for `KedroDataCatalog` and soon it will fully replace `DataCatalog`. So we encourage you to try it out, but it is possible we'll introduce breaking changes to this class, so be mindful of that. Let us know if you have any feedback about the `KedroDataCatalog` or ideas for new features. diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md new file mode 100644 index 0000000000..bf3f379f51 --- /dev/null +++ b/docs/source/data/kedro_data_catalog.md @@ -0,0 +1,18 @@ +# Kedro Data Catalog + +Since `KedroDataCatalog` repeats `DataCatalog` functionality we will only highlight new API and provide usage examples at this page. +The rest examples provided for `DataCatalog` previously are relevant for `KedroDataCatalog`, thus we recommend first make yourself familiar with them and then start using new catalog. + +## How to make `KedroDataCatalog` default catalog for Kedro `run` + +To make `KedroDataCatalog` default catalog for Kedro `run` and other CLI commands modify your `settings.py` as follows: + +```python +from kedro.io import KedroDataCatalog + +DATA_CATALOG_CLASS = KedroDataCatalog +``` + +Then run your Kedro project as usual. + +For more details about `settings.py` see [Project settings page](../kedro_project_setup/settings.md) From c38f3c0496244b53e1744f8bc89c59b83bedaa2d Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 12:41:55 +0100 Subject: [PATCH 02/23] Draft page for KedroDataCatalog Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 81 ++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index bf3f379f51..ab283b66ec 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -16,3 +16,84 @@ DATA_CATALOG_CLASS = KedroDataCatalog Then run your Kedro project as usual. For more details about `settings.py` see [Project settings page](../kedro_project_setup/settings.md) + +## How to access dataset in the catalog + +```python +reviews_ds = catalog["reviews"] +reviews_ds = catalog.get("reviews", default=default_ds) +``` + +## How add dataset to the catalog + +New API allows adding both datasets and raw data as follows: + +```python +from kedro_datasets.pandas import CSVDataset + + +bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv") +catalog["bikes"] = bikes_ds # Set dataset +catalog["cars"] = ["Ferrari", "Audi"] # Set raw data +``` + +Under the hood raw data is added as `MemoryDataset` to the catalog. + +## How to iterate trough datasets in the catalog + +`KedroDataCatalog` allows iterating through keys - dataset names, values - datasets and items - tuples `(dataset_name, dataset)`. The default iteration is happening through keys as ir is done for dictionary. + +```python + +for ds_name in catalog: # __iter__ + pass + +for ds_name in catalog.keys(): # keys() + pass + +for ds in catalog.values(): # values() + pass + +for ds_name, ds in catalog.items(): # items() + pass +``` + +## How to get the number of datasets in the catalog + +```python +ds_count = len(catalog) +``` + +## How to print catalog and dataset + +To print catalog or dataset programmatically: + +```python +print(catalog) + +print(catalog["reviews"]) +``` + +To print catalog or dataset in the interactive environment (IPython, JupyterLab and other Jupyter clients): + +```bash +catalog + +catalog["reviews"] +``` + +## How to access dataset patterns + +Patterns resolution logic is now encapsulated to the `config_resolver` which is available as a catalog's property: + +```python +config_resolver = catalog.config_resolver +ds_config = catalog.config_resolver.resolve_pattern(ds_name) # resolve dataset patterns +patterns = catalog.config_resolver.list_patterns() # list al patterns available in the catalog +``` + +```{note} +`KedroDataCatalog` do not support all dict-specific functionality like pop(), popitem(), del by key. +``` + +For the full set of supported methods please refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py) From f7016f55b50ce1efd769031c02dedbac8768229d Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 13:06:15 +0100 Subject: [PATCH 03/23] Updated index page Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index d9dcf265aa..511d325841 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -47,12 +47,18 @@ This section on handing data with Kedro concludes with an advanced use case, ill how_to_create_a_custom_dataset ``` -From Kedro 0.19.0 you can use an experimental feature - `KedroDataCatalog` instead of `DataCatalog`. +## KedroDataCatalog (Experimental Feature) -Currently, it repeats `DataCatalog` functionality and fully compatible with Kedro `run` with a few API enhancements: - * Removed `_FrozenDatasets` and access datasets as properties; - * `KedroDataCatalog` supports dict-like interface to get/set datasets and iterate through them. +As of Kedro 0.19.0, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. -A separate page of [Kedro Data Catalog](./kedro_data_catalog.md) shows `KedroDataCatalog` usage examples, and it's new API but all the information provided for the `DataCatalog` is relevant as well. +At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements: +* Simplified Dataset Access: `_FrozenDatasets` has been removed. +* Enhanced Dict-Like Interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -It is an experimental feature and is under active development. Though all the new catalog features will be released for `KedroDataCatalog` and soon it will fully replace `DataCatalog`. So we encourage you to try it out, but it is possible we'll introduce breaking changes to this class, so be mindful of that. Let us know if you have any feedback about the `KedroDataCatalog` or ideas for new features. +For more details and examples of how to use `KedroDataCatalog`, see the [Kedro Data Catalog page](./kedro_data_catalog.md). The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. + +```{note} +`KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. +``` + +We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features. From a3442d5489207bc016cb0da323e0fbbd3a5e9d61 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 13:24:49 +0100 Subject: [PATCH 04/23] Updated Kedro Data Catalog page Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 46 ++++++++++++++------------ 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index ab283b66ec..0c59b9f2a5 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -1,11 +1,9 @@ # Kedro Data Catalog - -Since `KedroDataCatalog` repeats `DataCatalog` functionality we will only highlight new API and provide usage examples at this page. -The rest examples provided for `DataCatalog` previously are relevant for `KedroDataCatalog`, thus we recommend first make yourself familiar with them and then start using new catalog. +`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. This page highlights the new features and provides usage examples. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`. ## How to make `KedroDataCatalog` default catalog for Kedro `run` -To make `KedroDataCatalog` default catalog for Kedro `run` and other CLI commands modify your `settings.py` as follows: +To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows: ```python from kedro.io import KedroDataCatalog @@ -13,12 +11,14 @@ from kedro.io import KedroDataCatalog DATA_CATALOG_CLASS = KedroDataCatalog ``` -Then run your Kedro project as usual. +Once this change is made, you can run your Kedro project as usual. -For more details about `settings.py` see [Project settings page](../kedro_project_setup/settings.md) +For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md). ## How to access dataset in the catalog +You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method: + ```python reviews_ds = catalog["reviews"] reviews_ds = catalog.get("reviews", default=default_ds) @@ -26,47 +26,49 @@ reviews_ds = catalog.get("reviews", default=default_ds) ## How add dataset to the catalog -New API allows adding both datasets and raw data as follows: +The new API allows you to add datasets as well as raw data directly to the catalog: ```python from kedro_datasets.pandas import CSVDataset bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv") -catalog["bikes"] = bikes_ds # Set dataset -catalog["cars"] = ["Ferrari", "Audi"] # Set raw data +catalog["bikes"] = bikes_ds # Adding a dataset +catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data ``` -Under the hood raw data is added as `MemoryDataset` to the catalog. +When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood. ## How to iterate trough datasets in the catalog -`KedroDataCatalog` allows iterating through keys - dataset names, values - datasets and items - tuples `(dataset_name, dataset)`. The default iteration is happening through keys as ir is done for dictionary. +`KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries: ```python -for ds_name in catalog: # __iter__ +for ds_name in catalog: # __iter__ defaults to keys pass -for ds_name in catalog.keys(): # keys() +for ds_name in catalog.keys(): # Iterate over dataset names pass -for ds in catalog.values(): # values() +for ds in catalog.values(): # Iterate over datasets pass -for ds_name, ds in catalog.items(): # items() +for ds_name, ds in catalog.items(): # Iterate over (name, dataset) tuples pass ``` ## How to get the number of datasets in the catalog +You can get the number of datasets in the catalog using the `len()` function: + ```python ds_count = len(catalog) ``` ## How to print catalog and dataset -To print catalog or dataset programmatically: +To print the catalog or an individual dataset programmatically, use the `print()` function: ```python print(catalog) @@ -74,7 +76,7 @@ print(catalog) print(catalog["reviews"]) ``` -To print catalog or dataset in the interactive environment (IPython, JupyterLab and other Jupyter clients): +In an interactive environment like IPython or JupyterLab, simply entering the variable will display it: ```bash catalog @@ -84,16 +86,16 @@ catalog["reviews"] ## How to access dataset patterns -Patterns resolution logic is now encapsulated to the `config_resolver` which is available as a catalog's property: +The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog: ```python config_resolver = catalog.config_resolver -ds_config = catalog.config_resolver.resolve_pattern(ds_name) # resolve dataset patterns -patterns = catalog.config_resolver.list_patterns() # list al patterns available in the catalog +ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern +patterns = catalog.config_resolver.list_patterns() # Listing all available patterns ``` ```{note} -`KedroDataCatalog` do not support all dict-specific functionality like pop(), popitem(), del by key. +`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`). ``` -For the full set of supported methods please refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py) +For a full list of supported methods, refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py). From 33e314a93703592efa7c9f4d9050753a1919973c Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 13:37:23 +0100 Subject: [PATCH 05/23] Added kedro_data_catalog to toctree Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 511d325841..a2234f0f83 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -55,7 +55,13 @@ At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and * Simplified Dataset Access: `_FrozenDatasets` has been removed. * Enhanced Dict-Like Interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -For more details and examples of how to use `KedroDataCatalog`, see the [Kedro Data Catalog page](./kedro_data_catalog.md). The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. +For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. + +```{toctree} +:maxdepth: 1 + +kedro_data_catalog +``` ```{note} `KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. From 44d0207eefba0613b5dd74cc4c377975d17be49e Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 15:26:21 +0100 Subject: [PATCH 06/23] Updated docstrings Signed-off-by: Elena Khaustova --- kedro/io/kedro_data_catalog.py | 166 +++++++++++++++++++++++++++++++-- 1 file changed, 157 insertions(+), 9 deletions(-) diff --git a/kedro/io/kedro_data_catalog.py b/kedro/io/kedro_data_catalog.py index c3d216abcd..f36bcc7d16 100644 --- a/kedro/io/kedro_data_catalog.py +++ b/kedro/io/kedro_data_catalog.py @@ -64,10 +64,12 @@ def __init__( Example: :: - >>> # settings.py - >>> from kedro.io import KedroDataCatalog + >>> from kedro_datasets.pandas import CSVDataset >>> - >>> DATA_CATALOG_CLASS = KedroDataCatalog + >>> cars = CSVDataset(filepath="cars.csv", + >>> load_args=None, + >>> save_args={"index": False}) + >>> catalog = KedroDataCatalog(datasets={'cars': cars}) """ self._config_resolver = config_resolver or CatalogConfigResolver() self._datasets = datasets or {} @@ -102,34 +104,83 @@ def __repr__(self) -> str: return repr(self._datasets) def __contains__(self, dataset_name: str) -> bool: - """Check if an item is in the catalog as a materialised dataset or pattern""" + """Check if an item is in the catalog as a materialised dataset or pattern.""" return ( dataset_name in self._datasets or self._config_resolver.match_pattern(dataset_name) is not None ) def __eq__(self, other) -> bool: # type: ignore[no-untyped-def] + """Compares two catalogs based on materialised datasets' and datasets' patterns.""" return (self._datasets, self._config_resolver.list_patterns()) == ( other._datasets, other.config_resolver.list_patterns(), ) def keys(self) -> List[str]: # noqa: UP006 + """List all dataset names registered in the catalog.""" return list(self.__iter__()) def values(self) -> List[AbstractDataset]: # noqa: UP006 + """List all datasets registered in the catalog.""" return [self._datasets[key] for key in self] def items(self) -> List[tuple[str, AbstractDataset]]: # noqa: UP006 + """List all dataset names and datasets registered in the catalog.""" return [(key, self._datasets[key]) for key in self] def __iter__(self) -> Iterator[str]: yield from self._datasets.keys() def __getitem__(self, ds_name: str) -> AbstractDataset: + """Get a dataset by name from an internal collection of datasets. + + If a dataset is not in the collection but matches any pattern + it is instantiated and added to the collection first, then returned. + + Args: + ds_name: A dataset name. + + Returns: + An instance of AbstractDataset. + + Raises: + DatasetNotFoundError: When a dataset with the given name + is not in the collection and do not match patterns. + """ return self.get_dataset(ds_name) def __setitem__(self, key: str, value: Any) -> None: + """Add dataset to the ``KedroDataCatalog`` using key as a datsets name and the data provided through the value. + + Values can either be raw data or Kedro datasets - instances of classes that inherit from ``AbstractDataset``. + If raw data is provided, it will be automatically wrapped in a ``MemoryDataset`` before being added to the ``KedroDataCatalog``. + + Args: + key: A dataset name. + value: Raw data or instance of classes that inherit from ``AbstractDataset``. + + Example: + :: + + >>> from kedro_datasets.pandas import CSVDataset + >>> import pandas as pd + >>> + >>> df = pd.DataFrame({"col1": [1, 2], + >>> "col2": [4, 5], + >>> "col3": [5, 6]}) + >>> + >>> catalog = KedroDataCatalog() + >>> catalog["data_df"] = df + >>> + >>> assert catalog.load("data_df").equals(df) + >>> + >>> csv_dataset = CSVDataset(filepath="test.csv") + >>> csv_dataset.save(df) + >>> catalog["data_csv_dataset"] = csv_dataset + >>> + >>> assert catalog.load("data_csv_dataset").equals(df) + """ if key in self._datasets: self._logger.warning("Replacing dataset '%s'", key) if isinstance(value, AbstractDataset): @@ -144,7 +195,19 @@ def __len__(self) -> int: def get( self, key: str, default: AbstractDataset | None = None ) -> AbstractDataset | None: - """Get a dataset by name from an internal collection of datasets.""" + """Get a dataset by name from an internal collection of datasets. + + If a dataset is not in the collection but matches any pattern + it is instantiated and added to the collection first, then returned. + + Args: + key: A dataset name. + default: Optional argument for default dataset to return in case + requested dataset not in the catalog. + + Returns: + An instance of AbstractDataset. + """ if key not in self._datasets: ds_config = self._config_resolver.resolve_pattern(key) if ds_config: @@ -172,6 +235,69 @@ def from_config( """Create a ``KedroDataCatalog`` instance from configuration. This is a factory method used to provide developers with a way to instantiate ``KedroDataCatalog`` with configuration parsed from configuration files. + + Args: + catalog: A dictionary whose keys are the dataset names and + the values are dictionaries with the constructor arguments + for classes implementing ``AbstractDataset``. The dataset + class to be loaded is specified with the key ``type`` and their + fully qualified class name. All ``kedro.io`` dataset can be + specified by their class name only, i.e. their module name + can be omitted. + credentials: A dictionary containing credentials for different + datasets. Use the ``credentials`` key in a ``AbstractDataset`` + to refer to the appropriate credentials as shown in the example + below. + load_versions: A mapping between dataset names and versions + to load. Has no effect on datasets without enabled versioning. + save_version: Version string to be used for ``save`` operations + by all datasets with enabled versioning. It must: a) be a + case-insensitive string that conforms with operating system + filename limitations, b) always return the latest version when + sorted in lexicographical order. + + Returns: + An instantiated ``DataCatalog`` containing all specified + datasets, created and ready to use. + + Raises: + DatasetNotFoundError: When `load_versions` refers to a dataset that doesn't + exist in the catalog. + + Example: + :: + + >>> config = { + >>> "cars": { + >>> "type": "pandas.CSVDataset", + >>> "filepath": "cars.csv", + >>> "save_args": { + >>> "index": False + >>> } + >>> }, + >>> "boats": { + >>> "type": "pandas.CSVDataset", + >>> "filepath": "s3://aws-bucket-name/boats.csv", + >>> "credentials": "boats_credentials", + >>> "save_args": { + >>> "index": False + >>> } + >>> } + >>> } + >>> + >>> credentials = { + >>> "boats_credentials": { + >>> "client_kwargs": { + >>> "aws_access_key_id": "", + >>> "aws_secret_access_key": "" + >>> } + >>> } + >>> } + >>> + >>> catalog = KedroDataCatalog.from_config(config, credentials) + >>> + >>> df = catalog.load("cars") + >>> catalog.save("boats", df) """ catalog = catalog or {} config_resolver = CatalogConfigResolver(catalog, credentials) @@ -284,10 +410,32 @@ def list( self, regex_search: str | None = None, regex_flags: int | re.RegexFlag = 0 ) -> List[str]: # noqa: UP006 # TODO: rename depending on the solution for https://github.com/kedro-org/kedro/issues/3917 - """ - List of all dataset names registered in the catalog. - This can be filtered by providing an optional regular expression - which will only return matching keys. + # TODO: make regex_search mandatory argument as we have catalog.keys() for listing all the datasets. + """List of all dataset names registered in the catalog. + + This can be filtered by providing an optional regular expression which will only return matching keys. + + Args: + regex_search: An optional regular expression which can be provided + to limit the datasets returned by a particular pattern. + regex_flags: An optional combination of regex flags. + Returns: + A list of dataset names available which match the `regex_search` criteria (if provided). + All dataset names are returned by default. + + Raises: + SyntaxError: When an invalid regex filter is provided. + + Example: + :: + + >>> catalog = KedroDataCatalog() + >>> # get datasets where the substring 'raw' is present + >>> raw_data = catalog.list(regex_search='raw') + >>> # get datasets which start with 'prm' or 'feat' + >>> feat_eng_data = catalog.list(regex_search='^(prm|feat)') + >>> # get datasets which end with 'time_series' + >>> models = catalog.list(regex_search='.+time_series$') """ if regex_search is None: return self.keys() From 50f236dc5467d38c70e691908bf8f64edb32d076 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 15:40:11 +0100 Subject: [PATCH 07/23] Updated setter and list docstrings Signed-off-by: Elena Khaustova --- kedro/io/kedro_data_catalog.py | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/kedro/io/kedro_data_catalog.py b/kedro/io/kedro_data_catalog.py index f36bcc7d16..9d2103a40f 100644 --- a/kedro/io/kedro_data_catalog.py +++ b/kedro/io/kedro_data_catalog.py @@ -151,14 +151,16 @@ def __getitem__(self, ds_name: str) -> AbstractDataset: return self.get_dataset(ds_name) def __setitem__(self, key: str, value: Any) -> None: - """Add dataset to the ``KedroDataCatalog`` using key as a datsets name and the data provided through the value. + """Add dataset to the ``KedroDataCatalog`` using the given key as a datsets name + and the provided data as the value. - Values can either be raw data or Kedro datasets - instances of classes that inherit from ``AbstractDataset``. - If raw data is provided, it will be automatically wrapped in a ``MemoryDataset`` before being added to the ``KedroDataCatalog``. + The value can either be raw data or a Kedro dataset (i.e., an instance of a class + inheriting from ``AbstractDataset``). If raw data is provided, it will be automatically + wrapped in a ``MemoryDataset`` before being added to the catalog. Args: - key: A dataset name. - value: Raw data or instance of classes that inherit from ``AbstractDataset``. + key: Name of the dataset. + value: Raw data or an instance of a class inheriting from ``AbstractDataset``. Example: :: @@ -171,13 +173,13 @@ def __setitem__(self, key: str, value: Any) -> None: >>> "col3": [5, 6]}) >>> >>> catalog = KedroDataCatalog() - >>> catalog["data_df"] = df + >>> catalog["data_df"] = df # Add raw data as a dataset >>> >>> assert catalog.load("data_df").equals(df) >>> >>> csv_dataset = CSVDataset(filepath="test.csv") >>> csv_dataset.save(df) - >>> catalog["data_csv_dataset"] = csv_dataset + >>> catalog["data_csv_dataset"] = csv_dataset # Add a dataset instance >>> >>> assert catalog.load("data_csv_dataset").equals(df) """ @@ -411,20 +413,20 @@ def list( ) -> List[str]: # noqa: UP006 # TODO: rename depending on the solution for https://github.com/kedro-org/kedro/issues/3917 # TODO: make regex_search mandatory argument as we have catalog.keys() for listing all the datasets. - """List of all dataset names registered in the catalog. + """List all dataset names registered in the catalog, optionally filtered by a regex pattern. - This can be filtered by providing an optional regular expression which will only return matching keys. + If a regex pattern is provided, only dataset names matching the pattern will be returned. + This method supports optional regex flags for customization Args: - regex_search: An optional regular expression which can be provided - to limit the datasets returned by a particular pattern. - regex_flags: An optional combination of regex flags. + regex_search: Optional regular expression to filter dataset names. + regex_flags: Optional regex flags. Returns: - A list of dataset names available which match the `regex_search` criteria (if provided). - All dataset names are returned by default. + A list of dataset names that match the `regex_search` criteria. If no pattern is + provided, all dataset names are returned. Raises: - SyntaxError: When an invalid regex filter is provided. + SyntaxError: If the provided regex pattern is invalid. Example: :: From 527bc3fa6eaf7ffc67d924d5acc399dfed3a9aeb Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 15:55:09 +0100 Subject: [PATCH 08/23] Improved wordings Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index a2234f0f83..7195fb0cbd 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -52,8 +52,8 @@ how_to_create_a_custom_dataset As of Kedro 0.19.0, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements: -* Simplified Dataset Access: `_FrozenDatasets` has been removed. -* Enhanced Dict-Like Interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. +* Simplified dataset access: `_FrozenDatasets` has been removed. +* Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. From 3675539f7e1fcc7fd5e689d2f98a29dba9babb89 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 22 Oct 2024 16:16:13 +0100 Subject: [PATCH 09/23] Removed odd new line Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index 0c59b9f2a5..e12a2a6bfe 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -31,7 +31,6 @@ The new API allows you to add datasets as well as raw data directly to the catal ```python from kedro_datasets.pandas import CSVDataset - bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv") catalog["bikes"] = bikes_ds # Adding a dataset catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data From 14913a0cf1b700fc9c232f1862bc84382959396c Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 14:00:54 +0100 Subject: [PATCH 10/23] Point Kedro version Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 7195fb0cbd..c8ab0e49d1 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -49,7 +49,7 @@ how_to_create_a_custom_dataset ## KedroDataCatalog (Experimental Feature) -As of Kedro 0.19.0, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. +As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements: * Simplified dataset access: `_FrozenDatasets` has been removed. From 417c7b71c41cd64ff808a96e1696aed4706d2c7f Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:10:45 +0100 Subject: [PATCH 11/23] Added a note on how to access datasets after _FrozenDatasets class was removed Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index c8ab0e49d1..f1d9eee897 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -52,7 +52,7 @@ how_to_create_a_custom_dataset As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements: -* Simplified dataset access: `_FrozenDatasets` has been removed. +* Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets. * Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. From 40c539b5cceb43e60e81a952fb5a8b9a38dea241 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:22:22 +0100 Subject: [PATCH 12/23] Added a link to the old documentation Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index f1d9eee897..bd3fda09ea 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -55,7 +55,7 @@ At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and * Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets. * Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The documentation for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. +For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](../data_catalog.html) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. ```{toctree} :maxdepth: 1 From d9c0b4b3edfb685e9fad9a58bc2aa16b6bb615db Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:28:51 +0100 Subject: [PATCH 13/23] Added link to the Slack channel Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index bd3fda09ea..ccc1fa7302 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -67,4 +67,4 @@ kedro_data_catalog `KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. ``` -We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features. +We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](kedro-org.slack.com). From 9d153b1c841b3a917d18bb13d6e3883ec5b28af4 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:31:13 +0100 Subject: [PATCH 14/23] Fixed typos Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 8 ++++---- kedro/io/kedro_data_catalog.py | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index e12a2a6bfe..d01fbc34e4 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -1,7 +1,7 @@ # Kedro Data Catalog `KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. This page highlights the new features and provides usage examples. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`. -## How to make `KedroDataCatalog` default catalog for Kedro `run` +## How to make `KedroDataCatalog` the default catalog for Kedro `run` To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows: @@ -15,7 +15,7 @@ Once this change is made, you can run your Kedro project as usual. For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md). -## How to access dataset in the catalog +## How to access datasets in the catalog You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method: @@ -24,7 +24,7 @@ reviews_ds = catalog["reviews"] reviews_ds = catalog.get("reviews", default=default_ds) ``` -## How add dataset to the catalog +## How add datasets to the catalog The new API allows you to add datasets as well as raw data directly to the catalog: @@ -65,7 +65,7 @@ You can get the number of datasets in the catalog using the `len()` function: ds_count = len(catalog) ``` -## How to print catalog and dataset +## How to print the full catalog and individual datasets To print the catalog or an individual dataset programmatically, use the `print()` function: diff --git a/kedro/io/kedro_data_catalog.py b/kedro/io/kedro_data_catalog.py index 9d2103a40f..4a5c9aeb8e 100644 --- a/kedro/io/kedro_data_catalog.py +++ b/kedro/io/kedro_data_catalog.py @@ -173,7 +173,7 @@ def __setitem__(self, key: str, value: Any) -> None: >>> "col3": [5, 6]}) >>> >>> catalog = KedroDataCatalog() - >>> catalog["data_df"] = df # Add raw data as a dataset + >>> catalog["data_df"] = df # Add raw data as a MemoryDataset >>> >>> assert catalog.load("data_df").equals(df) >>> From 164d16e55133133451743b634b0f6dbe039dbabd Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:47:02 +0100 Subject: [PATCH 15/23] Added top links for how-to items Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index d01fbc34e4..8eead122db 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -1,5 +1,14 @@ # Kedro Data Catalog -`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. This page highlights the new features and provides usage examples. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`. +`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`. + +This page highlights the new features and provides usage examples: +* [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run) +* [How to access datasets in the catalog](#how-to-access-datasets-in-the-catalog) +* [How to add datasets to the catalog](#how-to-add-datasets-to-the-catalog) +* [How to iterate trough datasets in the catalog](#how-to-iterate-trough-datasets-in-the-catalog) +* [How to get the number of datasets in the catalog](#how-to-get-the-number-of-datasets-in-the-catalog) +* [How to print the full catalog and individual datasets](#how-to-print-the-full-catalog-and-individual-datasets) +* [How to access dataset patterns](#how-to-access-dataset-patterns) ## How to make `KedroDataCatalog` the default catalog for Kedro `run` @@ -24,7 +33,7 @@ reviews_ds = catalog["reviews"] reviews_ds = catalog.get("reviews", default=default_ds) ``` -## How add datasets to the catalog +## How to add datasets to the catalog The new API allows you to add datasets as well as raw data directly to the catalog: From 981e710453da188fca741bb7a890d56a3e73d5c4 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:47:34 +0100 Subject: [PATCH 16/23] Fixed page reference Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index ccc1fa7302..1baaa04146 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -55,7 +55,7 @@ At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and * Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets. * Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](../data_catalog.html) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. +For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](../data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. ```{toctree} :maxdepth: 1 From 9b8a61a572b701c170d4d69f551ff2adbd5e3e6a Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 16:55:33 +0100 Subject: [PATCH 17/23] Fixed page reference Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 1baaa04146..3e8887d769 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -55,7 +55,7 @@ At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and * Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets. * Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](../data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. +For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. ```{toctree} :maxdepth: 1 From 9870aeddbf8e877e8eac71826c0771eeca6e286b Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 17:01:32 +0100 Subject: [PATCH 18/23] Updated reference to slack Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 3e8887d769..20db895eee 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -67,4 +67,4 @@ kedro_data_catalog `KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. ``` -We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](kedro-org.slack.com). +We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://app.slack.com/client/T03QY1HBT52/C03RKP2LW64). From bf3eb68689c3f301ce86fbe833dcfa210f2b0532 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Wed, 23 Oct 2024 17:29:20 +0100 Subject: [PATCH 19/23] Updates slack link Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- docs/source/data/kedro_data_catalog.md | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 20db895eee..a6f08a91b2 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -67,4 +67,4 @@ kedro_data_catalog `KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. ``` -We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://app.slack.com/client/T03QY1HBT52/C03RKP2LW64). +We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://kedro-org.slack.com). diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index 8eead122db..db170f5638 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -52,7 +52,6 @@ When you add raw data, it is automatically wrapped in a `MemoryDataset` under th `KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries: ```python - for ds_name in catalog: # __iter__ defaults to keys pass From 941caaf7271b9ea4dc72589411c8d03edad757b2 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Thu, 24 Oct 2024 13:05:38 +0100 Subject: [PATCH 20/23] Quoted KedroDataCatalog in the title Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index a6f08a91b2..8866d99c08 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -47,7 +47,7 @@ This section on handing data with Kedro concludes with an advanced use case, ill how_to_create_a_custom_dataset ``` -## KedroDataCatalog (Experimental Feature) +## `KedroDataCatalog` (experimental feature) As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. From 06db04a9df7e879de7e83b86849614e7d97e8ff4 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Thu, 24 Oct 2024 13:08:53 +0100 Subject: [PATCH 21/23] Fixed typos Signed-off-by: Elena Khaustova --- kedro/io/kedro_data_catalog.py | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/kedro/io/kedro_data_catalog.py b/kedro/io/kedro_data_catalog.py index 4a5c9aeb8e..27dcf1a765 100644 --- a/kedro/io/kedro_data_catalog.py +++ b/kedro/io/kedro_data_catalog.py @@ -69,7 +69,7 @@ def __init__( >>> cars = CSVDataset(filepath="cars.csv", >>> load_args=None, >>> save_args={"index": False}) - >>> catalog = KedroDataCatalog(datasets={'cars': cars}) + >>> catalog = KedroDataCatalog(datasets={"cars": cars}) """ self._config_resolver = config_resolver or CatalogConfigResolver() self._datasets = datasets or {} @@ -111,7 +111,7 @@ def __contains__(self, dataset_name: str) -> bool: ) def __eq__(self, other) -> bool: # type: ignore[no-untyped-def] - """Compares two catalogs based on materialised datasets' and datasets' patterns.""" + """Compares two catalogs based on materialised datasets and datasets patterns.""" return (self._datasets, self._config_resolver.list_patterns()) == ( other._datasets, other.config_resolver.list_patterns(), @@ -146,7 +146,7 @@ def __getitem__(self, ds_name: str) -> AbstractDataset: Raises: DatasetNotFoundError: When a dataset with the given name - is not in the collection and do not match patterns. + is not in the collection and does not match patterns. """ return self.get_dataset(ds_name) @@ -259,7 +259,7 @@ class to be loaded is specified with the key ``type`` and their sorted in lexicographical order. Returns: - An instantiated ``DataCatalog`` containing all specified + An instantiated ``KedroDataCatalog`` containing all specified datasets, created and ready to use. Raises: @@ -475,12 +475,13 @@ def save(self, name: str, data: Any) -> None: >>> import pandas as pd >>> + >>> from kedro.io import KedroDataCatalog >>> from kedro_datasets.pandas import CSVDataset >>> >>> cars = CSVDataset(filepath="cars.csv", >>> load_args=None, >>> save_args={"index": False}) - >>> catalog = DataCatalog(datasets={'cars': cars}) + >>> catalog = KedroDataCatalog(datasets={'cars': cars}) >>> >>> df = pd.DataFrame({'col1': [1, 2], >>> 'col2': [4, 5], @@ -518,13 +519,13 @@ def load(self, name: str, version: str | None = None) -> Any: Example: :: - >>> from kedro.io import DataCatalog + >>> from kedro.io import KedroDataCatalog >>> from kedro_datasets.pandas import CSVDataset >>> >>> cars = CSVDataset(filepath="cars.csv", >>> load_args=None, >>> save_args={"index": False}) - >>> catalog = DataCatalog(datasets={'cars': cars}) + >>> catalog = KedroDataCatalog(datasets={'cars': cars}) >>> >>> df = catalog.load("cars") """ From a5affbf36f9c90e30fd6a2d0276c760dc7020062 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Thu, 24 Oct 2024 14:14:12 +0100 Subject: [PATCH 22/23] Added example of print output Signed-off-by: Elena Khaustova --- docs/source/data/kedro_data_catalog.md | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index db170f5638..e5ac5f110c 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -75,20 +75,14 @@ ds_count = len(catalog) ## How to print the full catalog and individual datasets -To print the catalog or an individual dataset programmatically, use the `print()` function: - -```python -print(catalog) - -print(catalog["reviews"]) -``` - -In an interactive environment like IPython or JupyterLab, simply entering the variable will display it: +To print the catalog or an individual dataset programmatically, use the `print()` function or in an interactive environment like IPython or JupyterLab, simply enter the variable: ```bash -catalog +In [1]: catalog +Out[1]: {'shuttles': kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}), 'preprocessed_companies': kedro_datasets.pandas.parquet_dataset.ParquetDataset(filepath=PurePosixPath('/data/02_intermediate/preprocessed_companies.pq'), protocol='file', load_args={}, save_args={}), 'params:model_options.test_size': kedro.io.memory_dataset.MemoryDataset(data=''), 'params:model_options.features': kedro.io.memory_dataset.MemoryDataset(data=''))} -catalog["reviews"] +In [2]: catalog["shuttles"] +Out[2]: kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}) ``` ## How to access dataset patterns From a370536a8d0e03d83c24c7e7c6f9006ec25cdde5 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Thu, 24 Oct 2024 17:20:29 +0100 Subject: [PATCH 23/23] Applied suggested changes Signed-off-by: Elena Khaustova --- docs/source/data/index.md | 6 ++++-- docs/source/data/kedro_data_catalog.md | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 8866d99c08..5efc0e5b6f 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -52,10 +52,10 @@ how_to_create_a_custom_dataset As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`. At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements: -* Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets. +* Simplified dataset access: `_FrozenDatasets` has been replaced with a public `get` method to retrieve datasets. * Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets. -For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. +For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. ```{toctree} :maxdepth: 1 @@ -63,6 +63,8 @@ For more details and examples of how to use `KedroDataCatalog`, see the Kedro Da kedro_data_catalog ``` +The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements. + ```{note} `KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`. ``` diff --git a/docs/source/data/kedro_data_catalog.md b/docs/source/data/kedro_data_catalog.md index e5ac5f110c..1c10ffebfd 100644 --- a/docs/source/data/kedro_data_catalog.md +++ b/docs/source/data/kedro_data_catalog.md @@ -1,5 +1,5 @@ # Kedro Data Catalog -`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`. +`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` [documentation](./data_catalog.md) before exploring the additional functionality of `KedroDataCatalog`. This page highlights the new features and provides usage examples: * [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run)