Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for KedroDataCatalog #4249

Merged
merged 26 commits into from
Oct 25, 2024
Merged

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Oct 22, 2024

Description

Solves #4237

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review October 22, 2024 15:22
docs/source/data/index.md Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I added some small nit comments. I'll do another full review later!

docs/source/data/index.md Outdated Show resolved Hide resolved
docs/source/data/index.md Outdated Show resolved Hide resolved
docs/source/data/index.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestions but looks good to me!

docs/source/data/index.md Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
docs/source/data/index.md Show resolved Hide resolved

bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
catalog["bikes"] = bikes_ds # Adding a dataset
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to docs. I find this not very intuitive, is there a symmetry between the get/setter? i.e.

catalog["dataset] = some_dataset # a real dataset class
catalog["a_list"] = [1,2,3] # a list

catalog["dataset] # return a dataset
catalog["a_list"] # Does this  return a dataset or list?

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setter acts similar to the add_feed_dict(), so it allows you to add either datasets or raw data. There's a note below explaining that.

When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.

Comment on lines 78 to 92
To print the catalog or an individual dataset programmatically, use the `print()` function:

```python
print(catalog)

print(catalog["reviews"])
```

In an interactive environment like IPython or JupyterLab, simply entering the variable will display it:

```bash
catalog

catalog["reviews"]
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this is not very useful since this is more about educating users printing is not needed in a terminal? The most common use case is Notebook, but I don't think we need to explains how this work in IPython/Notebook etc.

I would like to see what catalog and catalog["reviews"] print differently instead (with an example), maybe something with this format:

In [1]: %%capture my_print_output
    ...: print('test')
    ...:

In [2]: my_print_output
Out[2]: <IPython.utils.capture.CapturedIO at 0x7f2efa2c12d0>

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example of the output instead

Comment on lines +96 to +101
The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog:

```python
config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to docs, do we expect Config Resolver as something that user will interact directly? I expect it's more of a refactoring and not a very user facing component (like KedroContext).

Would it be bad if we wrap it under KedroDataCatalog? i.e.

class KedroDataCatalog:
   ...
   def resolve_pattern(self, ds_name):
       return self.config_resolver(ds_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not expect users to interact with it much, though they can. It's mostly needed for catalog CLI commands and maybe some other advanced usage. But we do not want to extend catalog API with these public methods as then it will have to be part of the Protocol.

```

```{note}
`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is del not supported, is it an implementation issue? What's the correct way to remove a dataset from catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an issue, it's done intentionally as we don't want datasets to be removed by users manually. Of course, one can still remove it from the private _datasets dictionary but we do not provide an API for that.

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but otherwise this looks great! 🔥

docs/source/data/index.md Show resolved Hide resolved
docs/source/data/index.md Outdated Show resolved Hide resolved
docs/source/data/kedro_data_catalog.md Outdated Show resolved Hide resolved
@ElenaKhaustova ElenaKhaustova requested a review from noklam October 25, 2024 10:55
@ElenaKhaustova ElenaKhaustova merged commit 9ec1796 into main Oct 25, 2024
34 checks passed
@ElenaKhaustova ElenaKhaustova deleted the docs/4237-kedro-data-catalog branch October 25, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants