Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Closed

Conversation

yury-fedotov
Copy link
Contributor

@yury-fedotov yury-fedotov commented May 6, 2024

Description

Part of progress towards: #3418

Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.

P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.

Development notes

  • Built docs locally to test that hyperlinks and formatting work as expected.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@yury-fedotov yury-fedotov marked this pull request as ready for review May 6, 2024 06:13
@yury-fedotov
Copy link
Contributor Author

@yury-fedotov
Copy link
Contributor Author

@noklam, @astrojuanlu, could you please have a look?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.

yury-fedotov and others added 9 commits May 16, 2024 19:31
Comment on customizations

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
For example in splitting

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Make provided examples title not bold

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
And as where appropriate

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Organise spelling

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Proper wording for pip install note

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Implies to consists replacement

Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>
Signed-off-by: Yury Fedotov <[email protected]>

# Conflicts:
#	RELEASE.md
@yury-fedotov
Copy link
Contributor Author

Hey @merelcht - thanks for comments again, I just finished implementing them.

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

@merelcht
Copy link
Member

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for the contribution!

It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

Comment on lines 8 to 10
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located only in those starter files, which is not true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md?
Or to make content of comments on that issue part of this new section of docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I am not the best person to ask for English but I'll try my best🤓

While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.

I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline use that convention), but not mandatory to name files pipeline.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:

  • As project evolves and nodes.py / pipeline.py files grow, they become challenging to manage.
  • Good news is that you can rename and restructure them.

Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.

```

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam how would you then call a .py file that is not nodes.py or pipeline.py for the purpose of this example? I was thinking of dataframe_utils.py, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is many thing that I don't like about utils (just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.

Even if we go with utils. I will strip the _utils suffix, it feels redundant to have pandas_utils.py under utils.py. Then in the code I will probably do from <pacakge> import utils. When I need to use it, I will use utils.pandas.func to make it clear that this is a util namespace but not pandas.

visualitization_utils.py could just be visualisation module itself.
(all above are subjective)

I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Comments re: naming utils modules and importing them - addressed ✔️
  • Re: introducing the principles, the hyperlink doesn't work for me, leads just to the questions channel.

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that exactly what I'm adding there?

Screenshot 2024-05-25 at 12 47 31 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado

Comment on lines 43 to 44
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
into `historical_pipeline.py` and `inference_pipeline.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we usually go for the namespace/modular structure more:

  • pipelines
    • historical
    • inference
      • pipeline.py

@astrojuanlu thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that historical vs inference sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:

If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.

Is that better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

into `historical_pipeline.py` and `inference_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are:

  • is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?

@stichbury
Copy link
Contributor

I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it.

Just ping me again when you decide how to go forward!

@merelcht
Copy link
Member

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.

I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.

@stichbury
Copy link
Contributor

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready.

@yury-fedotov
Copy link
Contributor Author

Thanks for the comments team! I'll get to them next week and come back.

@yury-fedotov
Copy link
Contributor Author

Thanks for review @datajoely !

I agree with what you said. I think the current content reflects that I had a different user problem in mind while writing this, and that caused the content to diverge from what you would've expected for the problem definition you mentioned.

Here's the difference.

You definition (copy pasting)

  • Business logic grows in complexity and should live and be tested outside of your core Kedro project.
  • You should also be aspiring to achieve the general principle of loose coupling, high cohesion
  • Python paths and packaging in general is a pain for many novice programmers.

Those problems are undoubtedly relevant IMO, I agree.

What I was trying to cover

  • Beginner users are sometimes unsure where to put the code as the project grows. Can it only be in nodes.py and pipeline.py? Would it work if I create a pipeline directory manually instead of kedro pipeline create? We indeed know answers to those, but the guide was trying to give those to people who are very new to Kedro.
  • It can also be challenging to grasp how would Kedro work if project root != git repo root.

It is actually an open question if problems I refer to above are relevant, but as I understood from 3418, they are.

Proposed next steps

I'm totally happy to adjust the content as much as needed, including ground up rewrite if it would make it more useful. But would appreciate a coordinated maintainers opinion on what should this content be.

@datajoely , @merelcht , @astrojuanlu , @noklam do you mind discussing internally (or I am happy to join too) how does the core team see the resolution of 3418 and let me know?

@merelcht
Copy link
Member

merelcht commented Jul 9, 2024

While this doesn't entirely close #3418, I do think it's a valuable addition to our docs and I would very much like this to go in. @noklam Could you have another look and comment on what you think still needs to change?

@merelcht merelcht requested review from noklam and removed request for yetudada July 9, 2024 09:42
@yury-fedotov
Copy link
Contributor Author

yury-fedotov commented Jul 11, 2024

Lol, I wanted to resolve merge conflict but hit very weird pre-commit issue, see here: #4004

I closed it though as not sure why that happens, maybe specific to my computer.

Resolved through GitHub UI since it was a one line thing in RELEASE.md

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @yury-fedotov! I really like this addition to the docs, and the Advanced usage section is the perfect spot for it. I appreciate you sharing these best practices for using Kedro, just as you did during the namespaces docs revision. It's important we continue to agree on these practices and share them with our users.

@yury-fedotov
Copy link
Contributor Author

Hey team, I've addressed recent comments. Could you have another look please?

I also went through the Language Linter for Kedro Docs error logs, and not sure how to proceed, since it does propose a few changes to this file, but it does that for most of the docs files in general. I understood it's used as a soft check?

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yury-fedotov , first of all thanks for your patience. It has taken me a while to give this a proper review because, although there's nothing wrong in what you've written, we still are not fully happy with how it fits with the rest of the documentation.

We highly appreciate this and your other contributions 🙏🏼

Roughly speaking, the prose you've written describes 4 things:

  • Where does Kedro look for code to be located. This is an essential concept actually, and it should have a dedicated page, for example under the "Kedro projects" header, where we currently have other explanations (using the Diátaxis terminology). In fact, this is closely related to Document workflow to incrementally create a minimal Kedro project from scratch #2512, which we never addressed. Therefore, my recommendation would be to open a separate PR just for that document ↪
  • Alternative ways of organising pipelines. I think that's more or less covered in https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html already, and I find it weird to have another set of recommendations or possible patterns here. Therefore my recommendation is to drop this ♻️
  • Sharing modules between pipelines. It's very difficult for me to judge how appropriate this is for the Kedro documentation, because as much as we want to upskill data scientists, it reads like a shallow explanation of how Python packages and imports work. In addition, and this has been mentioned by others as well, it's unclear if we actually want to recommend a "utils" module, given that it's a bit of an antipattern ([1], [2]). This doesn't mean it's not useful - it means that we don't have a good place for this. There are many other fundamental Python concepts we don't explain. So my recommendation is to drop this as well ♻️
  • Kedro project in a monorepo setup. I think this gets into murky territory. I dumped some thoughts on monorepos here Thoughts on monorepos (dependency isolation) #4147 I think the word "monorepo" is very overloaded, there are different use cases, and Python tooling to manage them is quite immature. I'm not opposed to having recommendations about this though - but I think it should go to a "how to organize big Kedro projects", and a more thorough breakdown of the different things people might want. So my recommendation would be to open a new ticket and a separate PR for it ➕

@astrojuanlu astrojuanlu removed their assignment Sep 12, 2024
@datajoely
Copy link
Contributor

Shall we close this and instead schedule a session to work out Kedro's view on this topic?

@astrojuanlu
Copy link
Member

In my review comment #3852 (review) I recommended a way forward for the 4 distinct blocks I observed in this PR.

@datajoely Do you think we can do something else to clarify next steps?

@datajoely
Copy link
Contributor

No I think it's great - but I'd love to do a session where think through those topics, possible co-design with power users like @yury-fedotov . It's hard to separate philosophy from opinion / tooling preference (which ages badly)

Even something like uv may look like black (which was novel in 2017) does today in 5 years....

@yury-fedotov
Copy link
Contributor Author

@datajoely thanks! I'm happy to participate in this session :)
For now, let me close this PR, and then when I get to it in the future, open new ones for separate items, as @astrojuanlu proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants