Skip to content

Commit

Permalink
Add docs for KedroDataCatalog (#4249)
Browse files Browse the repository at this point in the history
* Added setting default catalog section

Signed-off-by: Elena Khaustova <[email protected]>

* Draft page for KedroDataCatalog

Signed-off-by: Elena Khaustova <[email protected]>

* Updated index page

Signed-off-by: Elena Khaustova <[email protected]>

* Updated Kedro Data Catalog page

Signed-off-by: Elena Khaustova <[email protected]>

* Added kedro_data_catalog to toctree

Signed-off-by: Elena Khaustova <[email protected]>

* Updated docstrings

Signed-off-by: Elena Khaustova <[email protected]>

* Updated setter and list docstrings

Signed-off-by: Elena Khaustova <[email protected]>

* Improved wordings

Signed-off-by: Elena Khaustova <[email protected]>

* Removed odd new line

Signed-off-by: Elena Khaustova <[email protected]>

* Point Kedro version

Signed-off-by: Elena Khaustova <[email protected]>

* Added a note on how to access datasets after _FrozenDatasets class was removed

Signed-off-by: Elena Khaustova <[email protected]>

* Added a link to the old documentation

Signed-off-by: Elena Khaustova <[email protected]>

* Added link to the Slack channel

Signed-off-by: Elena Khaustova <[email protected]>

* Fixed typos

Signed-off-by: Elena Khaustova <[email protected]>

* Added top links for how-to items

Signed-off-by: Elena Khaustova <[email protected]>

* Fixed page reference

Signed-off-by: Elena Khaustova <[email protected]>

* Fixed page reference

Signed-off-by: Elena Khaustova <[email protected]>

* Updated reference to slack

Signed-off-by: Elena Khaustova <[email protected]>

* Updates slack link

Signed-off-by: Elena Khaustova <[email protected]>

* Quoted KedroDataCatalog in the title

Signed-off-by: Elena Khaustova <[email protected]>

* Fixed typos

Signed-off-by: Elena Khaustova <[email protected]>

* Added example of print output

Signed-off-by: Elena Khaustova <[email protected]>

* Applied suggested changes

Signed-off-by: Elena Khaustova <[email protected]>

---------

Signed-off-by: Elena Khaustova <[email protected]>
  • Loading branch information
ElenaKhaustova authored Oct 25, 2024
1 parent cbde71f commit 9ec1796
Show file tree
Hide file tree
Showing 3 changed files with 290 additions and 13 deletions.
26 changes: 25 additions & 1 deletion docs/source/data/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# The Kedro Data Catalog
# Data Catalog

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class.

Expand Down Expand Up @@ -46,3 +46,27 @@ This section on handing data with Kedro concludes with an advanced use case, ill
how_to_create_a_custom_dataset
```

## `KedroDataCatalog` (experimental feature)

As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`.

At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements:
* Simplified dataset access: `_FrozenDatasets` has been replaced with a public `get` method to retrieve datasets.
* Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets.

For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page.

```{toctree}
:maxdepth: 1
kedro_data_catalog
```

The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements.

```{note}
`KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`.
```

We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://kedro-org.slack.com).
102 changes: 102 additions & 0 deletions docs/source/data/kedro_data_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Kedro Data Catalog
`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` [documentation](./data_catalog.md) before exploring the additional functionality of `KedroDataCatalog`.

This page highlights the new features and provides usage examples:
* [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run)
* [How to access datasets in the catalog](#how-to-access-datasets-in-the-catalog)
* [How to add datasets to the catalog](#how-to-add-datasets-to-the-catalog)
* [How to iterate trough datasets in the catalog](#how-to-iterate-trough-datasets-in-the-catalog)
* [How to get the number of datasets in the catalog](#how-to-get-the-number-of-datasets-in-the-catalog)
* [How to print the full catalog and individual datasets](#how-to-print-the-full-catalog-and-individual-datasets)
* [How to access dataset patterns](#how-to-access-dataset-patterns)

## How to make `KedroDataCatalog` the default catalog for Kedro `run`

To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows:

```python
from kedro.io import KedroDataCatalog

DATA_CATALOG_CLASS = KedroDataCatalog
```

Once this change is made, you can run your Kedro project as usual.

For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md).

## How to access datasets in the catalog

You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method:

```python
reviews_ds = catalog["reviews"]
reviews_ds = catalog.get("reviews", default=default_ds)
```

## How to add datasets to the catalog

The new API allows you to add datasets as well as raw data directly to the catalog:

```python
from kedro_datasets.pandas import CSVDataset

bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
catalog["bikes"] = bikes_ds # Adding a dataset
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data
```

When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.

## How to iterate trough datasets in the catalog

`KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries:

```python
for ds_name in catalog: # __iter__ defaults to keys
pass

for ds_name in catalog.keys(): # Iterate over dataset names
pass

for ds in catalog.values(): # Iterate over datasets
pass

for ds_name, ds in catalog.items(): # Iterate over (name, dataset) tuples
pass
```

## How to get the number of datasets in the catalog

You can get the number of datasets in the catalog using the `len()` function:

```python
ds_count = len(catalog)
```

## How to print the full catalog and individual datasets

To print the catalog or an individual dataset programmatically, use the `print()` function or in an interactive environment like IPython or JupyterLab, simply enter the variable:

```bash
In [1]: catalog
Out[1]: {'shuttles': kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}), 'preprocessed_companies': kedro_datasets.pandas.parquet_dataset.ParquetDataset(filepath=PurePosixPath('/data/02_intermediate/preprocessed_companies.pq'), protocol='file', load_args={}, save_args={}), 'params:model_options.test_size': kedro.io.memory_dataset.MemoryDataset(data='<float>'), 'params:model_options.features': kedro.io.memory_dataset.MemoryDataset(data='<list>'))}

In [2]: catalog["shuttles"]
Out[2]: kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'})
```

## How to access dataset patterns

The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog:

```python
config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
```

```{note}
`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`).
```

For a full list of supported methods, refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py).
175 changes: 163 additions & 12 deletions kedro/io/kedro_data_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,12 @@ def __init__(
Example:
::
>>> # settings.py
>>> from kedro.io import KedroDataCatalog
>>> from kedro_datasets.pandas import CSVDataset
>>>
>>> DATA_CATALOG_CLASS = KedroDataCatalog
>>> cars = CSVDataset(filepath="cars.csv",
>>> load_args=None,
>>> save_args={"index": False})
>>> catalog = KedroDataCatalog(datasets={"cars": cars})
"""
self._config_resolver = config_resolver or CatalogConfigResolver()
self._datasets = datasets or {}
Expand Down Expand Up @@ -102,34 +104,85 @@ def __repr__(self) -> str:
return repr(self._datasets)

def __contains__(self, dataset_name: str) -> bool:
"""Check if an item is in the catalog as a materialised dataset or pattern"""
"""Check if an item is in the catalog as a materialised dataset or pattern."""
return (
dataset_name in self._datasets
or self._config_resolver.match_pattern(dataset_name) is not None
)

def __eq__(self, other) -> bool: # type: ignore[no-untyped-def]
"""Compares two catalogs based on materialised datasets and datasets patterns."""
return (self._datasets, self._config_resolver.list_patterns()) == (
other._datasets,
other.config_resolver.list_patterns(),
)

def keys(self) -> List[str]: # noqa: UP006
"""List all dataset names registered in the catalog."""
return list(self.__iter__())

def values(self) -> List[AbstractDataset]: # noqa: UP006
"""List all datasets registered in the catalog."""
return [self._datasets[key] for key in self]

def items(self) -> List[tuple[str, AbstractDataset]]: # noqa: UP006
"""List all dataset names and datasets registered in the catalog."""
return [(key, self._datasets[key]) for key in self]

def __iter__(self) -> Iterator[str]:
yield from self._datasets.keys()

def __getitem__(self, ds_name: str) -> AbstractDataset:
"""Get a dataset by name from an internal collection of datasets.
If a dataset is not in the collection but matches any pattern
it is instantiated and added to the collection first, then returned.
Args:
ds_name: A dataset name.
Returns:
An instance of AbstractDataset.
Raises:
DatasetNotFoundError: When a dataset with the given name
is not in the collection and does not match patterns.
"""
return self.get_dataset(ds_name)

def __setitem__(self, key: str, value: Any) -> None:
"""Add dataset to the ``KedroDataCatalog`` using the given key as a datsets name
and the provided data as the value.
The value can either be raw data or a Kedro dataset (i.e., an instance of a class
inheriting from ``AbstractDataset``). If raw data is provided, it will be automatically
wrapped in a ``MemoryDataset`` before being added to the catalog.
Args:
key: Name of the dataset.
value: Raw data or an instance of a class inheriting from ``AbstractDataset``.
Example:
::
>>> from kedro_datasets.pandas import CSVDataset
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"col1": [1, 2],
>>> "col2": [4, 5],
>>> "col3": [5, 6]})
>>>
>>> catalog = KedroDataCatalog()
>>> catalog["data_df"] = df # Add raw data as a MemoryDataset
>>>
>>> assert catalog.load("data_df").equals(df)
>>>
>>> csv_dataset = CSVDataset(filepath="test.csv")
>>> csv_dataset.save(df)
>>> catalog["data_csv_dataset"] = csv_dataset # Add a dataset instance
>>>
>>> assert catalog.load("data_csv_dataset").equals(df)
"""
if key in self._datasets:
self._logger.warning("Replacing dataset '%s'", key)
if isinstance(value, AbstractDataset):
Expand All @@ -144,7 +197,19 @@ def __len__(self) -> int:
def get(
self, key: str, default: AbstractDataset | None = None
) -> AbstractDataset | None:
"""Get a dataset by name from an internal collection of datasets."""
"""Get a dataset by name from an internal collection of datasets.
If a dataset is not in the collection but matches any pattern
it is instantiated and added to the collection first, then returned.
Args:
key: A dataset name.
default: Optional argument for default dataset to return in case
requested dataset not in the catalog.
Returns:
An instance of AbstractDataset.
"""
if key not in self._datasets:
ds_config = self._config_resolver.resolve_pattern(key)
if ds_config:
Expand Down Expand Up @@ -172,6 +237,69 @@ def from_config(
"""Create a ``KedroDataCatalog`` instance from configuration. This is a
factory method used to provide developers with a way to instantiate
``KedroDataCatalog`` with configuration parsed from configuration files.
Args:
catalog: A dictionary whose keys are the dataset names and
the values are dictionaries with the constructor arguments
for classes implementing ``AbstractDataset``. The dataset
class to be loaded is specified with the key ``type`` and their
fully qualified class name. All ``kedro.io`` dataset can be
specified by their class name only, i.e. their module name
can be omitted.
credentials: A dictionary containing credentials for different
datasets. Use the ``credentials`` key in a ``AbstractDataset``
to refer to the appropriate credentials as shown in the example
below.
load_versions: A mapping between dataset names and versions
to load. Has no effect on datasets without enabled versioning.
save_version: Version string to be used for ``save`` operations
by all datasets with enabled versioning. It must: a) be a
case-insensitive string that conforms with operating system
filename limitations, b) always return the latest version when
sorted in lexicographical order.
Returns:
An instantiated ``KedroDataCatalog`` containing all specified
datasets, created and ready to use.
Raises:
DatasetNotFoundError: When `load_versions` refers to a dataset that doesn't
exist in the catalog.
Example:
::
>>> config = {
>>> "cars": {
>>> "type": "pandas.CSVDataset",
>>> "filepath": "cars.csv",
>>> "save_args": {
>>> "index": False
>>> }
>>> },
>>> "boats": {
>>> "type": "pandas.CSVDataset",
>>> "filepath": "s3://aws-bucket-name/boats.csv",
>>> "credentials": "boats_credentials",
>>> "save_args": {
>>> "index": False
>>> }
>>> }
>>> }
>>>
>>> credentials = {
>>> "boats_credentials": {
>>> "client_kwargs": {
>>> "aws_access_key_id": "<your key id>",
>>> "aws_secret_access_key": "<your secret>"
>>> }
>>> }
>>> }
>>>
>>> catalog = KedroDataCatalog.from_config(config, credentials)
>>>
>>> df = catalog.load("cars")
>>> catalog.save("boats", df)
"""
catalog = catalog or {}
config_resolver = CatalogConfigResolver(catalog, credentials)
Expand Down Expand Up @@ -284,10 +412,32 @@ def list(
self, regex_search: str | None = None, regex_flags: int | re.RegexFlag = 0
) -> List[str]: # noqa: UP006
# TODO: rename depending on the solution for https://github.com/kedro-org/kedro/issues/3917
"""
List of all dataset names registered in the catalog.
This can be filtered by providing an optional regular expression
which will only return matching keys.
# TODO: make regex_search mandatory argument as we have catalog.keys() for listing all the datasets.
"""List all dataset names registered in the catalog, optionally filtered by a regex pattern.
If a regex pattern is provided, only dataset names matching the pattern will be returned.
This method supports optional regex flags for customization
Args:
regex_search: Optional regular expression to filter dataset names.
regex_flags: Optional regex flags.
Returns:
A list of dataset names that match the `regex_search` criteria. If no pattern is
provided, all dataset names are returned.
Raises:
SyntaxError: If the provided regex pattern is invalid.
Example:
::
>>> catalog = KedroDataCatalog()
>>> # get datasets where the substring 'raw' is present
>>> raw_data = catalog.list(regex_search='raw')
>>> # get datasets which start with 'prm' or 'feat'
>>> feat_eng_data = catalog.list(regex_search='^(prm|feat)')
>>> # get datasets which end with 'time_series'
>>> models = catalog.list(regex_search='.+time_series$')
"""
if regex_search is None:
return self.keys()
Expand Down Expand Up @@ -325,12 +475,13 @@ def save(self, name: str, data: Any) -> None:
>>> import pandas as pd
>>>
>>> from kedro.io import KedroDataCatalog
>>> from kedro_datasets.pandas import CSVDataset
>>>
>>> cars = CSVDataset(filepath="cars.csv",
>>> load_args=None,
>>> save_args={"index": False})
>>> catalog = DataCatalog(datasets={'cars': cars})
>>> catalog = KedroDataCatalog(datasets={'cars': cars})
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>> 'col2': [4, 5],
Expand Down Expand Up @@ -368,13 +519,13 @@ def load(self, name: str, version: str | None = None) -> Any:
Example:
::
>>> from kedro.io import DataCatalog
>>> from kedro.io import KedroDataCatalog
>>> from kedro_datasets.pandas import CSVDataset
>>>
>>> cars = CSVDataset(filepath="cars.csv",
>>> load_args=None,
>>> save_args={"index": False})
>>> catalog = DataCatalog(datasets={'cars': cars})
>>> catalog = KedroDataCatalog(datasets={'cars': cars})
>>>
>>> df = catalog.load("cars")
"""
Expand Down

0 comments on commit 9ec1796

Please sign in to comment.