-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs for KedroDataCatalog
#4249
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I added some small nit comments. I'll do another full review later!
Signed-off-by: Elena Khaustova <[email protected]>
…s removed Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestions but looks good to me!
|
||
bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv") | ||
catalog["bikes"] = bikes_ds # Adding a dataset | ||
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to docs. I find this not very intuitive, is there a symmetry between the get/setter? i.e.
catalog["dataset] = some_dataset # a real dataset class
catalog["a_list"] = [1,2,3] # a list
catalog["dataset] # return a dataset
catalog["a_list"] # Does this return a dataset or list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setter acts similar to the add_feed_dict()
, so it allows you to add either datasets or raw data. There's a note below explaining that.
When you add raw data, it is automatically wrapped in a `MemoryDataset` under the hood.
To print the catalog or an individual dataset programmatically, use the `print()` function: | ||
|
||
```python | ||
print(catalog) | ||
|
||
print(catalog["reviews"]) | ||
``` | ||
|
||
In an interactive environment like IPython or JupyterLab, simply entering the variable will display it: | ||
|
||
```bash | ||
catalog | ||
|
||
catalog["reviews"] | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this is not very useful since this is more about educating users printing is not needed in a terminal? The most common use case is Notebook, but I don't think we need to explains how this work in IPython/Notebook etc.
I would like to see what catalog
and catalog["reviews"]
print differently instead (with an example), maybe something with this format:
In [1]: %%capture my_print_output
...: print('test')
...:
In [2]: my_print_output
Out[2]: <IPython.utils.capture.CapturedIO at 0x7f2efa2c12d0>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an example of the output instead
The pattern resolution logic in `KedroDataCatalog` is handled by the `config_resolver`, which can be accessed as a property of the catalog: | ||
|
||
```python | ||
config_resolver = catalog.config_resolver | ||
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern | ||
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly related to docs, do we expect Config Resolver as something that user will interact directly? I expect it's more of a refactoring and not a very user facing component (like KedroContext
).
Would it be bad if we wrap it under KedroDataCatalog
? i.e.
class KedroDataCatalog:
...
def resolve_pattern(self, ds_name):
return self.config_resolver(ds_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not expect users to interact with it much, though they can. It's mostly needed for catalog CLI commands and maybe some other advanced usage. But we do not want to extend catalog API with these public methods as then it will have to be part of the Protocol
.
``` | ||
|
||
```{note} | ||
`KedroDataCatalog` does not support all dictionary-specific methods, such as `pop()`, `popitem()`, or deletion by key (`del`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is del
not supported, is it an implementation issue? What's the correct way to remove a dataset from catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an issue, it's done intentionally as we don't want datasets to be removed by users manually. Of course, one can still remove it from the private _datasets
dictionary but we do not provide an API for that.
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but otherwise this looks great! 🔥
Signed-off-by: Elena Khaustova <[email protected]>
Description
Solves #4237
Development notes
KedroDaraCatalog
docstrings;DataCatalog
index page to include information aboutKedroDaraCatalog
- https://kedro--4249.org.readthedocs.build/en/4249/data/index.htmlKedroDaraCatalog
usage examples - https://kedro--4249.org.readthedocs.build/en/4249/data/kedro_data_catalog.htmlDeveloper Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file