Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for KedroDataCatalog #4249

Merged
merged 26 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
241fb59
Added setting default catalog section
ElenaKhaustova Oct 22, 2024
c38f3c0
Draft page for KedroDataCatalog
ElenaKhaustova Oct 22, 2024
f7016f5
Updated index page
ElenaKhaustova Oct 22, 2024
a3442d5
Updated Kedro Data Catalog page
ElenaKhaustova Oct 22, 2024
33e314a
Added kedro_data_catalog to toctree
ElenaKhaustova Oct 22, 2024
44d0207
Updated docstrings
ElenaKhaustova Oct 22, 2024
50f236d
Updated setter and list docstrings
ElenaKhaustova Oct 22, 2024
ab57346
Merge branch 'main' into docs/4237-kedro-data-catalog
ElenaKhaustova Oct 22, 2024
527bc3f
Improved wordings
ElenaKhaustova Oct 22, 2024
3675539
Removed odd new line
ElenaKhaustova Oct 22, 2024
14913a0
Point Kedro version
ElenaKhaustova Oct 23, 2024
417c7b7
Added a note on how to access datasets after _FrozenDatasets class wa…
ElenaKhaustova Oct 23, 2024
40c539b
Added a link to the old documentation
ElenaKhaustova Oct 23, 2024
d9c0b4b
Added link to the Slack channel
ElenaKhaustova Oct 23, 2024
9d153b1
Fixed typos
ElenaKhaustova Oct 23, 2024
ac83a8c
Merge branch 'main' into docs/4237-kedro-data-catalog
ElenaKhaustova Oct 23, 2024
164d16e
Added top links for how-to items
ElenaKhaustova Oct 23, 2024
981e710
Fixed page reference
ElenaKhaustova Oct 23, 2024
9b8a61a
Fixed page reference
ElenaKhaustova Oct 23, 2024
9870aed
Updated reference to slack
ElenaKhaustova Oct 23, 2024
bf3eb68
Updates slack link
ElenaKhaustova Oct 23, 2024
941caaf
Quoted KedroDataCatalog in the title
ElenaKhaustova Oct 24, 2024
06db04a
Fixed typos
ElenaKhaustova Oct 24, 2024
a5affbf
Added example of print output
ElenaKhaustova Oct 24, 2024
a370536
Applied suggested changes
ElenaKhaustova Oct 24, 2024
af58fae
Merge branch 'main' into docs/4237-kedro-data-catalog
ElenaKhaustova Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion docs/source/data/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# The Kedro Data Catalog
# Data Catalog
ankatiyar marked this conversation as resolved.
Show resolved Hide resolved

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class.

Expand Down Expand Up @@ -46,3 +46,25 @@

how_to_create_a_custom_dataset
```

## KedroDataCatalog (Experimental Feature)

Check warning on line 50 in docs/source/data/index.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/index.md#L50

[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.", "location": {"path": "docs/source/data/index.md", "range": {"start": {"line": 50, "column": 4}}}, "severity": "WARNING"}

Check warning on line 50 in docs/source/data/index.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/index.md#L50

[Kedro.headings] 'KedroDataCatalog (Experimental Feature)' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'KedroDataCatalog (Experimental Feature)' should use sentence-style capitalization.", "location": {"path": "docs/source/data/index.md", "range": {"start": {"line": 50, "column": 4}}}, "severity": "WARNING"}

As of Kedro 0.19.9, you can explore a new experimental feature — the `KedroDataCatalog`, an enhanced alternative to `DataCatalog`.

At present, `KedroDataCatalog` replicates the functionality of `DataCatalog` and is fully compatible with the Kedro `run` command. It introduces several API improvements:
* Simplified dataset access: `_FrozenDatasets` has been replaced with public `get` method to retrieve datasets.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
* Added dict-like interface: You can now use a dictionary-like syntax to retrieve, set, and iterate over datasets.

For more details and examples of how to use `KedroDataCatalog`, see the Kedro Data Catalog page. The [documentation](./data_catalog.md) for `DataCatalog` remains relevant as `KedroDataCatalog` retains its core functionality with some enhancements.

```{toctree}
:maxdepth: 1

kedro_data_catalog
```
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

```{note}
`KedroDataCatalog` is under active development and may undergo breaking changes in future releases. While we encourage you to try it out, please be aware of potential modifications as we continue to improve it. Additionally, all upcoming catalog-related features will be introduced through `KedroDataCatalog` before it replaces `DataCatalog`.
```

We value your feedback — let us know if you have any thoughts or suggestions regarding `KedroDataCatalog` or potential new features via our [Slack channel](https://kedro-org.slack.com).

Check warning on line 70 in docs/source/data/index.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/index.md#L70

[Kedro.toowordy] 'regarding' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'regarding' is too wordy", "location": {"path": "docs/source/data/index.md", "range": {"start": {"line": 70, "column": 78}}}, "severity": "WARNING"}

Check warning on line 70 in docs/source/data/index.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/index.md#L70

[Kedro.words] Use 'with' or 'through' instead of 'via'.
Raw output
{"message": "[Kedro.words] Use 'with' or 'through' instead of 'via'.", "location": {"path": "docs/source/data/index.md", "range": {"start": {"line": 70, "column": 133}}}, "severity": "WARNING"}
108 changes: 108 additions & 0 deletions docs/source/data/kedro_data_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Kedro Data Catalog
`KedroDataCatalog` retains the core functionality of `DataCatalog`, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing `DataCatalog` documentation before exploring the additional functionality of `KedroDataCatalog`.

Check warning on line 2 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L2

[Kedro.weaselwords] 'few' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 2, "column": 76}}}, "severity": "WARNING"}
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

This page highlights the new features and provides usage examples:
* [How to make KedroDataCatalog the default catalog for Kedro run](#how-to-make-kedrodatacatalog-the-default-catalog-for-kedro-run)

Check warning on line 5 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L5

[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 5, "column": 16}}}, "severity": "WARNING"}
* [How to access datasets in the catalog](#how-to-access-datasets-in-the-catalog)
* [How to add datasets to the catalog](#how-to-add-datasets-to-the-catalog)
* [How to iterate trough datasets in the catalog](#how-to-iterate-trough-datasets-in-the-catalog)
* [How to get the number of datasets in the catalog](#how-to-get-the-number-of-datasets-in-the-catalog)
* [How to print the full catalog and individual datasets](#how-to-print-the-full-catalog-and-individual-datasets)
* [How to access dataset patterns](#how-to-access-dataset-patterns)

## How to make `KedroDataCatalog` the default catalog for Kedro `run`

To set `KedroDataCatalog` as the default catalog for the `kedro run` command and other CLI commands, update your `settings.py` as follows:

```python
from kedro.io import KedroDataCatalog

DATA_CATALOG_CLASS = KedroDataCatalog
```

Once this change is made, you can run your Kedro project as usual.

For more information on `settings.py`, refer to the [Project settings documentation](../kedro_project_setup/settings.md).

Check warning on line 25 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L25

[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.
Raw output
{"message": "[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 25, "column": 40}}}, "severity": "WARNING"}

## How to access datasets in the catalog

You can retrieve a dataset from the catalog using either the dictionary-like syntax or the `get` method:

```python
reviews_ds = catalog["reviews"]
reviews_ds = catalog.get("reviews", default=default_ds)
```

## How to add datasets to the catalog

The new API allows you to add datasets as well as raw data directly to the catalog:

```python
from kedro_datasets.pandas import CSVDataset

bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
catalog["bikes"] = bikes_ds # Adding a dataset
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to docs. I find this not very intuitive, is there a symmetry between the get/setter? i.e.

catalog["dataset] = some_dataset # a real dataset class
catalog["a_list"] = [1,2,3] # a list

catalog["dataset] # return a dataset
catalog["a_list"] # Does this  return a dataset or list?

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setter acts similar to the add_feed_dict(), so it allows you to add either datasets or raw data. There's a note below explaining that.

When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.

```

When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.

Check warning on line 48 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L48

[Kedro.words] Use '' instead of 'under the hood'.
Raw output
{"message": "[Kedro.words] Use '' instead of 'under the hood'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 48, "column": 73}}}, "severity": "WARNING"}

## How to iterate trough datasets in the catalog

`KedroDataCatalog` supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries:

Check warning on line 52 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L52

[Kedro.toowordy] 'similar to' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'similar to' is too wordy", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 52, "column": 140}}}, "severity": "WARNING"}

```python
for ds_name in catalog: # __iter__ defaults to keys
pass

for ds_name in catalog.keys(): # Iterate over dataset names
pass

for ds in catalog.values(): # Iterate over datasets
pass

for ds_name, ds in catalog.items(): # Iterate over (name, dataset) tuples
pass
```

## How to get the number of datasets in the catalog

You can get the number of datasets in the catalog using the `len()` function:

```python
ds_count = len(catalog)
```

## How to print the full catalog and individual datasets

To print the catalog or an individual dataset programmatically, use the `print()` function:

```python
print(catalog)

print(catalog["reviews"])
```

In an interactive environment like IPython or JupyterLab, simply entering the variable will display it:

Check warning on line 86 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L86

[Kedro.weaselwords] 'simply' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'simply' is a weasel word!", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 86, "column": 59}}}, "severity": "WARNING"}

Check warning on line 86 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L86

[Kedro.words] Use '' instead of 'simply'.
Raw output
{"message": "[Kedro.words] Use '' instead of 'simply'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 86, "column": 59}}}, "severity": "WARNING"}

```bash
catalog

catalog["reviews"]
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this is not very useful since this is more about educating users printing is not needed in a terminal? The most common use case is Notebook, but I don't think we need to explains how this work in IPython/Notebook etc.

I would like to see what catalog and catalog["reviews"] print differently instead (with an example), maybe something with this format:

In [1]: %%capture my_print_output
    ...: print('test')
    ...:

In [2]: my_print_output
Out[2]: <IPython.utils.capture.CapturedIO at 0x7f2efa2c12d0>

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example of the output instead


## How to access dataset patterns

The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog:

```python
config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
Comment on lines +90 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to docs, do we expect Config Resolver as something that user will interact directly? I expect it's more of a refactoring and not a very user facing component (like KedroContext).

Would it be bad if we wrap it under KedroDataCatalog? i.e.

class KedroDataCatalog:
   ...
   def resolve_pattern(self, ds_name):
       return self.config_resolver(ds_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not expect users to interact with it much, though they can. It's mostly needed for catalog CLI commands and maybe some other advanced usage. But we do not want to extend catalog API with these public methods as then it will have to be part of the Protocol.

```

```{note}
`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is del not supported, is it an implementation issue? What's the correct way to remove a dataset from catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an issue, it's done intentionally as we don't want datasets to be removed by users manually. Of course, one can still remove it from the private _datasets dictionary but we do not provide an API for that.

```

For a full list of supported methods, refer to the [KedroDataCatalog source code](https://github.com/kedro-org/kedro/blob/main/kedro/io/kedro_data_catalog.py).

Check warning on line 108 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L108

[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.
Raw output
{"message": "[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 108, "column": 39}}}, "severity": "WARNING"}

Check warning on line 108 in docs/source/data/kedro_data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_data_catalog.md#L108

[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'KedroDataCatalog'.", "location": {"path": "docs/source/data/kedro_data_catalog.md", "range": {"start": {"line": 108, "column": 53}}}, "severity": "WARNING"}
168 changes: 159 additions & 9 deletions kedro/io/kedro_data_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,12 @@ def __init__(

Example:
::
>>> # settings.py
>>> from kedro.io import KedroDataCatalog
>>> from kedro_datasets.pandas import CSVDataset
>>>
>>> DATA_CATALOG_CLASS = KedroDataCatalog
>>> cars = CSVDataset(filepath="cars.csv",
>>> load_args=None,
>>> save_args={"index": False})
>>> catalog = KedroDataCatalog(datasets={'cars': cars})
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
"""
self._config_resolver = config_resolver or CatalogConfigResolver()
self._datasets = datasets or {}
Expand Down Expand Up @@ -102,34 +104,85 @@ def __repr__(self) -> str:
return repr(self._datasets)

def __contains__(self, dataset_name: str) -> bool:
"""Check if an item is in the catalog as a materialised dataset or pattern"""
"""Check if an item is in the catalog as a materialised dataset or pattern."""
return (
dataset_name in self._datasets
or self._config_resolver.match_pattern(dataset_name) is not None
)

def __eq__(self, other) -> bool: # type: ignore[no-untyped-def]
"""Compares two catalogs based on materialised datasets' and datasets' patterns."""
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
return (self._datasets, self._config_resolver.list_patterns()) == (
other._datasets,
other.config_resolver.list_patterns(),
)

def keys(self) -> List[str]: # noqa: UP006
merelcht marked this conversation as resolved.
Show resolved Hide resolved
"""List all dataset names registered in the catalog."""
return list(self.__iter__())

def values(self) -> List[AbstractDataset]: # noqa: UP006
"""List all datasets registered in the catalog."""
return [self._datasets[key] for key in self]

def items(self) -> List[tuple[str, AbstractDataset]]: # noqa: UP006
"""List all dataset names and datasets registered in the catalog."""
return [(key, self._datasets[key]) for key in self]

def __iter__(self) -> Iterator[str]:
yield from self._datasets.keys()

def __getitem__(self, ds_name: str) -> AbstractDataset:
"""Get a dataset by name from an internal collection of datasets.

If a dataset is not in the collection but matches any pattern
it is instantiated and added to the collection first, then returned.

Args:
ds_name: A dataset name.

Returns:
An instance of AbstractDataset.

Raises:
DatasetNotFoundError: When a dataset with the given name
is not in the collection and do not match patterns.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
"""
return self.get_dataset(ds_name)

def __setitem__(self, key: str, value: Any) -> None:
"""Add dataset to the ``KedroDataCatalog`` using the given key as a datsets name
and the provided data as the value.

The value can either be raw data or a Kedro dataset (i.e., an instance of a class
inheriting from ``AbstractDataset``). If raw data is provided, it will be automatically
wrapped in a ``MemoryDataset`` before being added to the catalog.

Args:
key: Name of the dataset.
value: Raw data or an instance of a class inheriting from ``AbstractDataset``.

Example:
::

>>> from kedro_datasets.pandas import CSVDataset
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"col1": [1, 2],
>>> "col2": [4, 5],
>>> "col3": [5, 6]})
>>>
>>> catalog = KedroDataCatalog()
>>> catalog["data_df"] = df # Add raw data as a MemoryDataset
>>>
>>> assert catalog.load("data_df").equals(df)
>>>
>>> csv_dataset = CSVDataset(filepath="test.csv")
>>> csv_dataset.save(df)
>>> catalog["data_csv_dataset"] = csv_dataset # Add a dataset instance
>>>
>>> assert catalog.load("data_csv_dataset").equals(df)
"""
if key in self._datasets:
self._logger.warning("Replacing dataset '%s'", key)
if isinstance(value, AbstractDataset):
Expand All @@ -144,7 +197,19 @@ def __len__(self) -> int:
def get(
self, key: str, default: AbstractDataset | None = None
) -> AbstractDataset | None:
"""Get a dataset by name from an internal collection of datasets."""
"""Get a dataset by name from an internal collection of datasets.

If a dataset is not in the collection but matches any pattern
it is instantiated and added to the collection first, then returned.

Args:
key: A dataset name.
default: Optional argument for default dataset to return in case
requested dataset not in the catalog.

Returns:
An instance of AbstractDataset.
"""
if key not in self._datasets:
ds_config = self._config_resolver.resolve_pattern(key)
if ds_config:
Expand Down Expand Up @@ -172,6 +237,69 @@ def from_config(
"""Create a ``KedroDataCatalog`` instance from configuration. This is a
factory method used to provide developers with a way to instantiate
``KedroDataCatalog`` with configuration parsed from configuration files.

Args:
catalog: A dictionary whose keys are the dataset names and
the values are dictionaries with the constructor arguments
for classes implementing ``AbstractDataset``. The dataset
class to be loaded is specified with the key ``type`` and their
fully qualified class name. All ``kedro.io`` dataset can be
specified by their class name only, i.e. their module name
can be omitted.
credentials: A dictionary containing credentials for different
datasets. Use the ``credentials`` key in a ``AbstractDataset``
to refer to the appropriate credentials as shown in the example
below.
load_versions: A mapping between dataset names and versions
to load. Has no effect on datasets without enabled versioning.
save_version: Version string to be used for ``save`` operations
by all datasets with enabled versioning. It must: a) be a
case-insensitive string that conforms with operating system
filename limitations, b) always return the latest version when
sorted in lexicographical order.

Returns:
An instantiated ``DataCatalog`` containing all specified
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
datasets, created and ready to use.

Raises:
DatasetNotFoundError: When `load_versions` refers to a dataset that doesn't
exist in the catalog.

Example:
::

>>> config = {
>>> "cars": {
>>> "type": "pandas.CSVDataset",
>>> "filepath": "cars.csv",
>>> "save_args": {
>>> "index": False
>>> }
>>> },
>>> "boats": {
>>> "type": "pandas.CSVDataset",
>>> "filepath": "s3://aws-bucket-name/boats.csv",
>>> "credentials": "boats_credentials",
>>> "save_args": {
>>> "index": False
>>> }
>>> }
>>> }
>>>
>>> credentials = {
>>> "boats_credentials": {
>>> "client_kwargs": {
>>> "aws_access_key_id": "<your key id>",
>>> "aws_secret_access_key": "<your secret>"
>>> }
>>> }
>>> }
>>>
>>> catalog = KedroDataCatalog.from_config(config, credentials)
>>>
>>> df = catalog.load("cars")
>>> catalog.save("boats", df)
"""
catalog = catalog or {}
config_resolver = CatalogConfigResolver(catalog, credentials)
Expand Down Expand Up @@ -284,10 +412,32 @@ def list(
self, regex_search: str | None = None, regex_flags: int | re.RegexFlag = 0
) -> List[str]: # noqa: UP006
# TODO: rename depending on the solution for https://github.com/kedro-org/kedro/issues/3917
"""
List of all dataset names registered in the catalog.
This can be filtered by providing an optional regular expression
which will only return matching keys.
# TODO: make regex_search mandatory argument as we have catalog.keys() for listing all the datasets.
"""List all dataset names registered in the catalog, optionally filtered by a regex pattern.

If a regex pattern is provided, only dataset names matching the pattern will be returned.
This method supports optional regex flags for customization

Args:
regex_search: Optional regular expression to filter dataset names.
regex_flags: Optional regex flags.
Returns:
A list of dataset names that match the `regex_search` criteria. If no pattern is
provided, all dataset names are returned.

Raises:
SyntaxError: If the provided regex pattern is invalid.

Example:
::

>>> catalog = KedroDataCatalog()
>>> # get datasets where the substring 'raw' is present
>>> raw_data = catalog.list(regex_search='raw')
>>> # get datasets which start with 'prm' or 'feat'
>>> feat_eng_data = catalog.list(regex_search='^(prm|feat)')
>>> # get datasets which end with 'time_series'
>>> models = catalog.list(regex_search='.+time_series$')
"""
if regex_search is None:
return self.keys()
Expand Down