Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Pretty printing #3913

Closed
ElenaKhaustova opened this issue Jun 3, 2024 · 11 comments
Closed

[DataCatalog]: Pretty printing #3913

ElenaKhaustova opened this issue Jun 3, 2024 · 11 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature Type: Parent Issue

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 3, 2024

Description

Compiling the catalog at runtime hinders users' ability to assess its structure and contents effectively. They express the need for an improved visual representation of the catalog when printing.

We propose:

  1. Explore the feasibility of developing a dedicated function to compile the catalog.
  2. Implement a "pretty printing" - implement catalog.print() / catalog.__repr__ function specifically tailored to improve the visual representation of the catalog when printed or displayed.

Relates to #1721

Context

There is no a particular requirement from user side on how output should look like. They mention that they expect to have something better than the following to help them understand what's in the catalog when debugging:

"I have Kedro jupyter notebook opening. I have a catalog object. Maybe we could have a nicer representation when you do that. You know how many data sets are available, things like that potentially."

"pretty printing (print(catalog) should give something understandable than a long, a minima catalog.list() and maybe details on the dataset)"

Current catalog printing output

Image

Catalog datasets printing output

Image

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 3, 2024
@astrojuanlu
Copy link
Member

Possibly a couple of bottlenecks for meaningful progress here are

  1. We should open up the DataCatalog abstraction, and instead of making it hide the datasets, just present it as a collection of those. Previous discussions point to this design decision, see for example Re-design io.core and io.data_catalog #1778 (comment)
  2. We should be able to pretty print individual datasets, otherwise we won't be able to do much apart from listing dataset names (although it must also be said that a list of datasets is probably better than a memory address)

@astrojuanlu
Copy link
Member

Notice also that, despite being somewhat limited in IPython, on Jupyter we can go crazy. See what Dask and Xarray do:

https://tutorial.dask.org/00_overview.html

image

https://tutorial.xarray.dev/overview/xarray-in-45-min.html

image

@ElenaKhaustova
Copy link
Contributor Author

This can be solved if #3932 is done, but the last is a much more complex task, and it's unclear whether/when it will be done.

@merelcht
Copy link
Member

This feels like a duplicate of a #1721 rather than just being related. Can we merge these tickets in a way and create something more actionable as sub task?

Maybe we can just make repr and str do the same for now? I don't see the need to change str, as useless as it might be...

That sounds like the most obvious task we could start with? cc: @noklam @astrojuanlu

@astrojuanlu
Copy link
Member

I think that comment from #1721 referred to AbstractDataset.__str__. But yes, a better AbstractDataset.__repr__ is needed to show a fancy DataCatalog.__repr__ I believe.

Agreed with having either this or #1721 be the main parent issue, closing the other, and opening sub-issues for specific classes, namely

  1. AbstractDataset
  2. DataCatalog (essentially this issue)
  3. OmegaConfigLoader (or its base class)

@noklam
Copy link
Contributor

noklam commented Jun 11, 2024

TL;DR I agree we should keep one of the two tickets only.

This can be solved if #3932 is done, but the last is a much more complex task, and it's unclear whether/when it will be done.

I am not sure #3932 is the full solution to this. The use case of (de)serialisation of DataCatalog is quite different from the interactively use case, and it's likely we don't want to show one big blob of JSON-like structure in a terminal.

I think I am holding my DataCatalog is a collection of datasets argument, the job of DataCatalog is to delegate or collect the implementation from AbstractDataset (we could have a base method in case it is not implemented). I would assume people don't really want to see the metadata of DataCatalog, it's just that they need to access datasets information through DataCatalog

image

I have the use case of interacting with big Kedro project in mind here. I agree we should keep one of the two tickets only. This ticket is not necessarily contradict with #1721, it could be part of the solution to this use case. i.e. print(catalog) with some basic info, catalog.summary() (like how pandas.DataFrame works), or even catalog.summary(<pipeline>)?

Top level information that I would love to know:

  • Number of datasets
  • Names of this datasets (catalog.list), incomplete if dataset factory pattern exists. There is a separate issue discuss this.

If I need further details of the datasets, i will then access it through a specific API (it shouldn't print all datasets details when I do print(catalog), it's too much information). Potentially, it could show one of these following information

@astrojuanlu
Copy link
Member

I think I am holding my "DataCatalog is a collection of datasets" argument,

Sorry @noklam I didn't quite get if you're in favour or against, could you clarify?

@noklam
Copy link
Contributor

noklam commented Jun 11, 2024

@astrojuanlu TL;DR, I think we should close one of the ticket. I don't want to get into the solution yet since I think the scope is not clear enough yet. (Something need to be done for DataCatalog, some are down to the Dataset level as you mentioned.

Compiling the catalog at runtime hinders users' ability to assess its structure and contents effectively. They express the need for an improved visual representation of the catalog when printing. (from this issue description)

The former should be handled separately, while "printing" in my mind is specific to interactive workflow (a.k.a Notebook 95% the time)

I think there are 2 approaches:

  1. We try to customise the printing (i.e. catalog.print, catalog.repr, catalog.str etc)
  2. We show the "raw" constructor as is (as if DataCatalog is a dataclass)

image

This is actually not too bad, but unfortunately framework and component is competing here. For example catalog._default_pattern is empty, while framework default is a memory dataset. The reason for this is that Framework actually inject the pattern through Runner, but this is very confusing to users that doesn't understand the details. So I lean more to 1.

@deepyaman
Copy link
Member

We can extend the built-in pprint.PrettyPrinter to support AbstractDataset and DataCatalog. __repr__ can also be made to leverage the pretty printer.

@astrojuanlu In notebooks, you can do quite a bit by providing an HTML representation. Scikit-learn also does this.

If this is not super urgent, I would be happy to prototype something, as I have some recent experience working with the pretty printer (and, because doing so is quite non-trivial, if you've never done it before).

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jul 2, 2024

A summary of what we suggest:

@ElenaKhaustova
Copy link
Contributor Author

Related issue: kedro-org/kedro-plugins#769

@github-project-automation github-project-automation bot moved this from In Progress to Done in Kedro Framework Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Type: Parent Issue
Projects
Archived in project
Development

No branches or pull requests

5 participants