Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Add functionality to search datasets in the catalog #3917

Open
ElenaKhaustova opened this issue Jun 3, 2024 · 11 comments
Open
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 3, 2024

Description

Users struggle to find datasets within the catalog, particularly when dealing with a large number of datasets. They express the need for search features to facilitate dataset discovery.

Context

"As a user in my list object, I can filter by name but I can't filter by what. So it would be good to be able to say give me all the sql datasets and then the names of the tables that are attached."

Comment form @astrojuanlu: Kedro Viz has an item in their roadmap to include a table view of all the metadata, could help with this.

Possible Implementation

Integrate search functionality into the catalog, enabling users to search for datasets based on keywords, patterns and by kind. Include support for regex search to accommodate users with advanced search requirements.

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 3, 2024
@datajoely
Copy link
Contributor

datajoely commented Jun 3, 2024

Also search by kind - if I wanted to find all Parquet files today I'd have to get very creative. Retrieving paths associated with those would be super complicated.

@astrojuanlu
Copy link
Member

@stephkaiser do you remember if we already opened an issue or discussion about the "metadata table view"?

(cc @rashidakanchwala for when you're back)

@astrojuanlu astrojuanlu added the Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets label Jun 3, 2024
@yury-fedotov
Copy link
Contributor

yury-fedotov commented Jun 3, 2024

... enabling users to search for datasets based on keywords or patterns...

Isn't it what catalog.list() already does?

IIRC if you do e.g. catalog.load("compani"), it would raise an error with did you mean one of ["companies", "processed_companies"]?

@astrojuanlu
Copy link
Member

Related #3312

@stephkaiser
Copy link

@stephkaiser do you remember if we already opened an issue or discussion about the "metadata table view"?

(cc @rashidakanchwala for when you're back)

@astrojuanlu we currently don't have an issue for this, I believe it was an idea we discussed when discussing this issue kedro-org/kedro-viz#1635

@astrojuanlu
Copy link
Member

Notice that catalog.list() supports RegEx, see #3924

@astrojuanlu
Copy link
Member

But

Integrate search functionality into the catalog, enabling users to search for datasets based on keywords, patterns and by kind. Include support for regex search to accommodate users with advanced search requirements.

this is a bit more advanced, I'd say

@merelcht
Copy link
Member

When you say "search datasets in the catalog", what workflow are you talking about? Is this inside a notebook, on the CLI, directly in the IDE or on Kedro-Viz? Each of these user flows might have a different preferred solution.

@ElenaKhaustova
Copy link
Contributor Author

After the discussion at backlog grooming, we've decided to:

  1. Revisit the docs on filtering with regex - to check if they're clear enough (as this feature exists but users were not able to find it);
  2. Revisit our filtering approach and decide on whether we want to extend it for example with "search by kind" method/ other (?)

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Oct 22, 2024

(Just dropping a comment now for when this is ready to be tackled properly)

When implementing the dict like interface in #4218, several possibilities were proposed to filter on the values with a regex, ordered after by order of apparition in the PR:

  1. ❌ enable to filter on KedroDataCatalog.keys(regex=...). the original proposition suggested the same interface for values and items but this was considered as confusing because for these 2 methods the filter would apply on the keys.
  2. ❌ add a KedroDataCatalog.filter(regex=...) method
  3. ✅ : keep the current KedroDataCatalog.list() method for regex filtering

Option 1 was overall considered as interesting, but several of us express concerns that it is not consistent with standard dict interface, hence it would affect discoverability.

Option 2 tends to be the leading choice, but option 3 jumped back a couple of minutes before merging after @idanov 's comment that it may be confusing because it's not clear what we are filtering on. I think we did not think enough about the implications and I want to reconsider it before the official release of KedroDataCatalog. I think the currently implemented option 3 is actually the worst of all three for several reasons :

  • it has existed for a long time, and we've had a lot of evidence during the user interviews that people (including myself 😉) were not aware about it, so there's a clear signal we need to change something
  • with the new implementation, we have list(catalog) and catalog.list() which do almost (but not exactly) the same thing. I think this will increase confusion about the method, because I'd bet we will document the two methods in different places and hurts discoverability.
  • I personnally find less explicit that .list() is for filtering / searching that filter

I think it's worth rediscuting / voting for this specific method, and not consider #3931 done at this point.

PS : Overall this new implementation is 🔥, I just want to speak now instead "forever until a new major version holds my peace"

@ElenaKhaustova
Copy link
Contributor Author

(Just dropping a comment now for when this is ready to be tackled properly)

When implementing the dict like interface in #4218, several possibilities were proposed to filter on the values with a regex, ordered after by order of apparition in the PR:

  1. ❌ enable to filter on KedroDataCatalog.keys(regex=...). the original proposition suggested the same interface for values and items but this was considered as confusing because for these 2 methods the filter would apply on the keys.
  2. ❌ add a KedroDataCatalog.filter(regex=...) method
  3. ✅ : keep the current KedroDataCatalog.list() method for regex filtering

Option 1 was overall considered as interesting, but several of us express concerns that it is not consistent with standard dict interface, hence it would affect discoverability.

Option 2 tends to be the leading choice, but option 3 jumped back a couple of minutes before merging after @idanov 's comment that it may be confusing because it's not clear what we are filtering on. I think we did not think enough about the implications and I want to reconsider it before the official release of KedroDataCatalog. I think the currently implemented option 3 is actually the worst of all three for several reasons :

  • it has existed for a long time, and we've had a lot of evidence during the user interviews that people (including myself 😉) were not aware about it, so there's a clear signal we need to change something
  • with the new implementation, we have list(catalog) and catalog.list() which do almost (but not exactly) the same thing. I think this will increase confusion about the method, because I'd bet we will document the two methods in different places and hurts discoverability.
  • I personnally find less explicit that .list() is for filtering / searching that filter

I think it's worth rediscuting / voting for this specific method, and not consider #3931 done at this point.

PS : Overall this new implementation is 🔥, I just want to speak now instead "forever until a new major version holds my peace"

Thanks for the summary, I do agree with you 🙂

For now, we agreed to think about the renaming when we work on this ticket. We wanted the name to be aligned with the filtering we will suggest, so we decided to keep the old name in the #4218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

7 participants