-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a docs page about adding code beyond starter files #3852
Create a docs page about adding code beyond starter files #3852
Conversation
Signed-off-by: Yury Fedotov <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
@noklam, @astrojuanlu, could you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.
Comment on customizations Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
For example in splitting Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
Make provided examples title not bold Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
And as where appropriate Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
Organise spelling Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
Proper wording for pip install note Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
Implies to consists replacement Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md
Signed-off-by: Yury Fedotov <[email protected]>
Hey @merelcht - thanks for comments again, I just finished implementing them. QQ also about the |
Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for the contribution!
It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?
I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu
read and collaborate on as your codebase grows. | ||
Those files also sometimes make new users think that Kedro requires code | ||
to be located only in those starter files, which is not true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the clarification, but I found this sounds like coming from an user rather than the official docs.
#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md
?
Or to make content of comments on that issue part of this new section of docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I am not the best person to ask for English but I'll try my best🤓
While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.
I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline
use that convention), but not mandatory to name files pipeline.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:
- As project evolves and
nodes.py
/pipeline.py
files grow, they become challenging to manage. - Good news is that you can rename and restructure them.
Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.
``` | ||
|
||
This being the only constraint means that you can, for example: | ||
* Add `utils.py` file to a pipeline folder and import utilities used by multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually avoid utils.py
as much as possible as they are the bin for everything. It's ironic because kedro
do have utils
module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam how would you then call a .py
file that is not nodes.py
or pipeline.py
for the purpose of this example? I was thinking of dataframe_utils.py
, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is many thing that I don't like about utils
(just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.
Even if we go with utils. I will strip the _utils
suffix, it feels redundant to have pandas_utils.py
under utils.py
. Then in the code I will probably do from <pacakge> import utils
. When I need to use it, I will use utils.pandas.func
to make it clear that this is a util namespace but not pandas
.
visualitization_utils.py
could just be visualisation
module itself.
(all above are subjective)
I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Comments re: naming utils modules and importing them - addressed ✔️
- Re: introducing the principles, the hyperlink doesn't work for me, leads just to the
questions
channel.
This being the only constraint means that you can, for example: | ||
* Add `utils.py` file to a pipeline folder and import utilities used by multiple | ||
functions in `nodes.py` from there. | ||
* [Share modules between pipelines](#sharing-modules-between-pipelines). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro
that we can link to?
Even just the tree
structure of a project would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, | ||
into `historical_pipeline.py` and `inference_pipeline.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we usually go for the namespace/modular structure more:
- pipelines
- historical
- inference
- pipeline.py
@astrojuanlu thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that historical
vs inference
sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:
If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.
Is that better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
into `historical_pipeline.py` and `inference_pipeline.py`. | ||
* Instead of registering many pipelines in `register_pipelines()` function one by one, | ||
create a few `tp.Dict[str, Pipeline]` objects in different places of the project | ||
and then make `register_pipelines()` return a union of those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about find_pipelines()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added:
While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My questions are:
- is
historical_pipeline.py
andinference_pipelines.py
better thanpipelines/historical/pipeline.py
(the modular pipeline structure) that Kedro usually promotes?
I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py
is just a convention and for find_pipeline
works automatically. User still have the option to register manually if desired.
@astrojuanlu thought?
I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:
LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it. Just ping me again when you decide how to go forward! |
Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.
I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.
For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready. |
Thanks for the comments team! I'll get to them next week and come back. |
Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md
Signed-off-by: Yury Fedotov <[email protected]>
Thanks for review @datajoely ! I agree with what you said. I think the current content reflects that I had a different user problem in mind while writing this, and that caused the content to diverge from what you would've expected for the problem definition you mentioned. Here's the difference. You definition (copy pasting)
What I was trying to cover
Proposed next stepsI'm totally happy to adjust the content as much as needed, including ground up rewrite if it would make it more useful. But would appreciate a coordinated maintainers opinion on what should this content be. @datajoely , @merelcht , @astrojuanlu , @noklam do you mind discussing internally (or I am happy to join too) how does the core team see the resolution of 3418 and let me know? |
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Lol, I wanted to resolve merge conflict but hit very weird pre-commit issue, see here: #4004 I closed it though as not sure why that happens, maybe specific to my computer. Resolved through GitHub UI since it was a one line thing in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @yury-fedotov! I really like this addition to the docs, and the Advanced usage
section is the perfect spot for it. I appreciate you sharing these best practices for using Kedro, just as you did during the namespaces docs revision. It's important we continue to agree on these practices and share them with our users.
Hey team, I've addressed recent comments. Could you have another look please? I also went through the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yury-fedotov , first of all thanks for your patience. It has taken me a while to give this a proper review because, although there's nothing wrong in what you've written, we still are not fully happy with how it fits with the rest of the documentation.
We highly appreciate this and your other contributions 🙏🏼
Roughly speaking, the prose you've written describes 4 things:
- Where does Kedro look for code to be located. This is an essential concept actually, and it should have a dedicated page, for example under the "Kedro projects" header, where we currently have other explanations (using the Diátaxis terminology). In fact, this is closely related to Document workflow to incrementally create a minimal Kedro project from scratch #2512, which we never addressed. Therefore, my recommendation would be to open a separate PR just for that document ↪
- Alternative ways of organising pipelines. I think that's more or less covered in https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html already, and I find it weird to have another set of recommendations or possible patterns here. Therefore my recommendation is to drop this ♻️
- Sharing modules between pipelines. It's very difficult for me to judge how appropriate this is for the Kedro documentation, because as much as we want to upskill data scientists, it reads like a shallow explanation of how Python packages and imports work. In addition, and this has been mentioned by others as well, it's unclear if we actually want to recommend a "utils" module, given that it's a bit of an antipattern ([1], [2]). This doesn't mean it's not useful - it means that we don't have a good place for this. There are many other fundamental Python concepts we don't explain. So my recommendation is to drop this as well ♻️
- Kedro project in a monorepo setup. I think this gets into murky territory. I dumped some thoughts on monorepos here Thoughts on monorepos (dependency isolation) #4147 I think the word "monorepo" is very overloaded, there are different use cases, and Python tooling to manage them is quite immature. I'm not opposed to having recommendations about this though - but I think it should go to a "how to organize big Kedro projects", and a more thorough breakdown of the different things people might want. So my recommendation would be to open a new ticket and a separate PR for it ➕
Shall we close this and instead schedule a session to work out Kedro's view on this topic? |
In my review comment #3852 (review) I recommended a way forward for the 4 distinct blocks I observed in this PR. @datajoely Do you think we can do something else to clarify next steps? |
No I think it's great - but I'd love to do a session where think through those topics, possible co-design with power users like @yury-fedotov . It's hard to separate philosophy from opinion / tooling preference (which ages badly) Even something like |
@datajoely thanks! I'm happy to participate in this session :) |
Description
Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.
P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file