-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design DataCatalog2.0
#3995
Comments
One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can. The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.
|
|
Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013 |
Summary of Integration Proposal: Kedro DataCatalog with Unity Data CatalogHow We Plan to IntegrateUnity Catalog Integration Options:
Local Workflow After Integration:
Remote Workflow After Integration:
Recommendation to Start with Databricks Python SDK via Databricks NotebooksAfter evaluating the Unity Catalog (open source) and Unity Catalog (Databricks) and their APIs, we recommend starting the integration using the Databricks Python SDK via Databricks notebooks. Reasons for Recommendation:
Challenges with Integration
Integration via Platform SDK and Databricks Notebook
General concerns regarding integration
|
Providing some context for "What is UnityCatalog" as I personally find their docs are very confusing. I think the main differentiation is more enterprise focus on governance/access control.
That's a large part of the RESTful API described above, it store metadata and connectors consume the metadata and act almost like a dblink
Main feature for access management.
I haven't seen too much about this
Metastore + UI as a shop for data source.
A bit similar to what |
With that in mind, my questions are:
With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow). I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using |
For me, the value is not particularly in integrating with The Our main focus remains on improving Kedro's DataCatalog based on insights from user research and interviews. |
Thanks a lot for the extensive research on the Unity Catalog @ElenaKhaustova 🙏🏼 My only point is that we should not make anything that's specific to Databricks Unity Catalog (since it's a commercial system) and it's a bit too early to understand how the different dataframe libraries and compute engines will interact with such metastores unitycatalog/unitycatalog#208 (reply in thread). At least, now that Polaris has just been open sourced apache/polaris#2, we know that the Apache Iceberg REST API "won", so if anything we should take that REST API as the reference. More questions from my side:
|
Thank you, @astrojuanlu! I fully agree with your points about not tightening to specific catalogs. The truth is that we are still determining whether we want to integrate with UnityCatalog/Polaris or something else. The answer might change depending on how they develop in the near future. That's why we suggest focusing on improving Kedro's DataCatalog, a solution shaped by insights from user research and interviews, and treating the integration part as research with a PoC as a target result. In order to work on those two goals in parallel, we plan to start with moving shared logic to the Answering other questions:
|
I just want to say I love the direction this is going ❤️ , great work folks |
We picked the following tickets: #3925, #3926, #3916 and #3931 as a starting point for the implementation of The following PRs include the drafts of
Mentioned PRs include a draft of the following:
Some explanations behind the decisions made:
After a brief discussion of changes made with @idanov and these features: #3935 and #3932 we would like to focus on the following topics:
|
The following PR #4084 includes updates required to use both We also tried an approach when moving the changes proposed incrementally to the existing There's also an approach where we use the Based on the above, the further suggested approach is:
Other things to take into consideration:
|
Some reflections after the tech design and thoughts on @deepyaman concerns and suggestions from here:
First of all, thank you for this summary - we appreciate that people are interested in this workstream and care about the results.
|
As you mention, "There is sufficient user interest to justify making At the very least, would like to see the ability to create and use a |
Some follow-ups after the discussion with @astrojuanlu, @merelcht and @deepyaman:
We don't think the abstraction itself is a blocker for making kedro/kedro/io/data_catalog.py Line 18 in 6c7a1cc
There is also a different opinion on the idea of splitting kedro into the smaller set of libs here: #3659 (comment) To sum up, we would like to keep this topic out of the discussion for now as a decision about the abstraction doesn't directly relate to the problem and it can be made later.
Given points 1, 2 and 3 we are going to:
|
The following ticket and PRs address this point from the above discussion:
Further steps suggested:
|
After discussing the above with @merelcht and @idanov, it was decided to split the above work into a set of incremental changes, modifying the existing catalog class or extending the functionality by introducing an abstraction without breaking changes where possible. Then, plan the set of breaking changes and discuss them separately. |
Some motivations behind the decision:
Long story short: we prefer a longer path with incremental changes of the existing catalog to bumping new catalog |
Since I believe the https://github.com/astrojuanlu/kedro-catalog hoping that they serve as inspiration. This is a prototype hacked in a rush so it's not meant to be a full replacement of the current
The codebase is lean and makes heavy use of Of course, it's tiny because it leaves lots of things out of the table. It critically does not support:
|
So I guess my real question is: Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around? |
I think the quote you shared is clearly hinting towards an answer - we already have a complex system, so unless we want to dismantle the whole complex functionality that Kedro offers, we'd be better off with incremental changes. Unless you are suggesting to redesign the whole of Kedro and go with Kedro 2.0, but I'd rather try to reach 1.0 first 😅 Nevertheless, the sketched out solution you've created definitely serves as nice inspiration and highlights some of the ideas already in circulation, namely employing protocols and dataclasses, which we should definitely drift towards. We should bear in mind that a lot can be achieved in non-breaking changes with a bit of creativity. In fact, the path might actually end up being much shorter if we go the non-breaking road, it might just involve more frequent smaller steps rather than a big jump, which will inevitably end up being followed by patch fixes, bug fixes and corner cases that we hadn't foreseen. |
The short answer to this: yes. The long answer: the incremental approach isn't a change in implementation and the user pain points it will tackle but in how we will deliver it. The current POC PRs tackle a lot all at once, which makes it hard to review and test properly. This will ultimately mean a delay in shipping and lower confidence that it works as expected. So like @idanov says, this iterative approach will likely end up being shorter and allow us to deliver improvements bit by bit. @ElenaKhaustova and I had another chat and the concrete next steps are:
|
Thank you, @astrojuanlu, for sharing your ideas and vision on the target for |
Further plan for
|
Solved |
Before moving forward with the above it makes sense to handle the following issues:
|
Description
The current
DataCatalog
in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: #3934As a result of
DataCatalog
user research interview we have created the list of tickets and split them into 3 categories:Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.
The objective is to design a new, robust, and modular
DataCatalog2.0
(a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.While redesigning we plan for a smooth migration from the current
DataCatalog
toDataCatalog2.0
, minimizing disruption for existing user.Context
Suggested prioritisation and tickets opened: #3934 (comment)
Related topics
https://github.com/kedro-org/kedro-starters/tree/main/standalone-datacatalog
#2901
#2741
Next steps
DataCatalog
architecture.Unity Catalog
,Polaris
,dlthub
, other ?) address similar tasks and challenges.DataCatalog2.0
.The text was updated successfully, but these errors were encountered: