Create a docs page about adding code beyond starter files #3852

yury-fedotov · 2024-05-06T06:10:31Z

Description

Part of progress towards: #3418

Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.

P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.

Development notes

Built docs locally to test that hyperlinks and formatting work as expected.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Yury Fedotov <[email protected]>

yury-fedotov · 2024-05-13T21:04:09Z

Rendered version: https://kedro--3852.org.readthedocs.build/en/3852/kedro_project_setup/code_beyond_starter_files.html

yury-fedotov · 2024-05-13T21:05:25Z

@noklam, @astrojuanlu, could you please have a look?

merelcht

Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.

docs/source/kedro_project_setup/code_beyond_starter_files.md

Comment on customizations Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

For example in splitting Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Make provided examples title not bold Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

And as where appropriate Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Organise spelling Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Proper wording for pip install note Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Implies to consists replacement Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md

Signed-off-by: Yury Fedotov <[email protected]>

yury-fedotov · 2024-05-17T01:23:57Z

Hey @merelcht - thanks for comments again, I just finished implementing them.

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

merelcht · 2024-05-17T07:53:35Z

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all.

noklam

First of all, thank you for the contribution!

It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

noklam · 2024-05-20T10:17:35Z

docs/source/kedro_project_setup/code_beyond_starter_files.md

+read and collaborate on as your codebase grows.
+Those files also sometimes make new users think that Kedro requires code
+to be located only in those starter files, which is not true.


I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md?
Or to make content of comments on that issue part of this new section of docs?

hmm... I am not the best person to ask for English but I'll try my best🤓

While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.

I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline use that convention), but not mandatory to name files pipeline.py

@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:

As project evolves and nodes.py / pipeline.py files grow, they become challenging to manage.

Good news is that you can rename and restructure them.

Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.

docs/source/kedro_project_setup/code_beyond_starter_files.md

noklam · 2024-05-20T10:22:35Z

docs/source/kedro_project_setup/code_beyond_starter_files.md

+```
+
+This being the only constraint means that you can, for example:
+* Add `utils.py` file to a pipeline folder and import utilities used by multiple


We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

@noklam how would you then call a .py file that is not nodes.py or pipeline.py for the purpose of this example? I was thinking of dataframe_utils.py, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.

There is many thing that I don't like about utils (just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.

Even if we go with utils. I will strip the _utils suffix, it feels redundant to have pandas_utils.py under utils.py. Then in the code I will probably do from <pacakge> import utils. When I need to use it, I will use utils.pandas.func to make it clear that this is a util namespace but not pandas.

visualitization_utils.py could just be visualisation module itself.
(all above are subjective)

I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?

Comments re: naming utils modules and importing them - addressed ✔️

Re: introducing the principles, the hyperlink doesn't work for me, leads just to the questions channel.

noklam · 2024-05-20T10:23:33Z

docs/source/kedro_project_setup/code_beyond_starter_files.md

+This being the only constraint means that you can, for example:
+* Add `utils.py` file to a pipeline folder and import utilities used by multiple
+  functions in `nodes.py` from there.
+* [Share modules between pipelines](#sharing-modules-between-pipelines).


This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

Isn't that exactly what I'm adding there?

The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code

I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado

docs/source/kedro_project_setup/code_beyond_starter_files.md

noklam · 2024-05-20T10:28:29Z

docs/source/kedro_project_setup/code_beyond_starter_files.md

+* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
+  into `historical_pipeline.py` and `inference_pipeline.py`.


I feel like we usually go for the namespace/modular structure more:

pipelines

historical

inference

pipeline.py

@astrojuanlu thoughts?

I agree that historical vs inference sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:

If you have multiple large `Pipeline` objects defined in a single `pipeline.py`, split them into separate `.py` files. For example, in `data_processing` pipeline you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.

Is that better?

sounds good!

docs/source/kedro_project_setup/code_beyond_starter_files.md

noklam · 2024-05-20T10:29:40Z

docs/source/kedro_project_setup/code_beyond_starter_files.md

+  into `historical_pipeline.py` and `inference_pipeline.py`.
+* Instead of registering many pipelines in `register_pipelines()` function one by one,
+  create a few `tp.Dict[str, Pipeline]` objects in different places of the project
+  and then make `register_pipelines()` return a union of those.


what about find_pipelines()?

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery), for large projects you may want a finer control and register pipelines manually.

My questions are:

is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?

stichbury · 2024-05-20T12:24:27Z

I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it.

Just ping me again when you decide how to go forward!

merelcht · 2024-05-20T14:01:02Z

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

astrojuanlu

Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.

I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.

docs/source/kedro_project_setup/code_beyond_starter_files.md

stichbury · 2024-05-20T16:31:55Z

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready.

yury-fedotov · 2024-05-24T00:16:59Z

Thanks for the comments team! I'll get to them next week and come back.

Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md

Signed-off-by: Yury Fedotov <[email protected]>

yury-fedotov · 2024-06-03T16:18:23Z

Thanks for review @datajoely !

I agree with what you said. I think the current content reflects that I had a different user problem in mind while writing this, and that caused the content to diverge from what you would've expected for the problem definition you mentioned.

Here's the difference.

You definition (copy pasting)

Business logic grows in complexity and should live and be tested outside of your core Kedro project.
You should also be aspiring to achieve the general principle of loose coupling, high cohesion
Python paths and packaging in general is a pain for many novice programmers.

Those problems are undoubtedly relevant IMO, I agree.

What I was trying to cover

Beginner users are sometimes unsure where to put the code as the project grows. Can it only be in nodes.py and pipeline.py? Would it work if I create a pipeline directory manually instead of kedro pipeline create? We indeed know answers to those, but the guide was trying to give those to people who are very new to Kedro.
It can also be challenging to grasp how would Kedro work if project root != git repo root.

It is actually an open question if problems I refer to above are relevant, but as I understood from 3418, they are.

Proposed next steps

I'm totally happy to adjust the content as much as needed, including ground up rewrite if it would make it more useful. But would appreciate a coordinated maintainers opinion on what should this content be.

@datajoely , @merelcht , @astrojuanlu , @noklam do you mind discussing internally (or I am happy to join too) how does the core team see the resolution of 3418 and let me know?

docs/source/kedro_project_setup/code_beyond_starter_files.md

Signed-off-by: Merel Theisen <[email protected]>

merelcht · 2024-07-09T09:42:36Z

While this doesn't entirely close #3418, I do think it's a valuable addition to our docs and I would very much like this to go in. @noklam Could you have another look and comment on what you think still needs to change?

yury-fedotov · 2024-07-11T03:10:32Z

Lol, I wanted to resolve merge conflict but hit very weird pre-commit issue, see here: #4004

I closed it though as not sure why that happens, maybe specific to my computer.

Resolved through GitHub UI since it was a one line thing in RELEASE.md

Signed-off-by: Yury Fedotov <[email protected]>

docs/source/kedro_project_setup/code_beyond_starter_files.md

DimedS

Thanks, @yury-fedotov! I really like this addition to the docs, and the Advanced usage section is the perfect spot for it. I appreciate you sharing these best practices for using Kedro, just as you did during the namespaces docs revision. It's important we continue to agree on these practices and share them with our users.

Signed-off-by: Yury Fedotov <[email protected]>

yury-fedotov · 2024-09-07T03:29:36Z

Hey team, I've addressed recent comments. Could you have another look please?

I also went through the Language Linter for Kedro Docs error logs, and not sure how to proceed, since it does propose a few changes to this file, but it does that for most of the docs files in general. I understood it's used as a soft check?

astrojuanlu

Hi @yury-fedotov , first of all thanks for your patience. It has taken me a while to give this a proper review because, although there's nothing wrong in what you've written, we still are not fully happy with how it fits with the rest of the documentation.

We highly appreciate this and your other contributions 🙏🏼

Roughly speaking, the prose you've written describes 4 things:

Where does Kedro look for code to be located. This is an essential concept actually, and it should have a dedicated page, for example under the "Kedro projects" header, where we currently have other explanations (using the Diátaxis terminology). In fact, this is closely related to Document workflow to incrementally create a minimal Kedro project from scratch #2512, which we never addressed. Therefore, my recommendation would be to open a separate PR just for that document ↪
Alternative ways of organising pipelines. I think that's more or less covered in https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html already, and I find it weird to have another set of recommendations or possible patterns here. Therefore my recommendation is to drop this ♻️
Sharing modules between pipelines. It's very difficult for me to judge how appropriate this is for the Kedro documentation, because as much as we want to upskill data scientists, it reads like a shallow explanation of how Python packages and imports work. In addition, and this has been mentioned by others as well, it's unclear if we actually want to recommend a "utils" module, given that it's a bit of an antipattern ([1], [2]). This doesn't mean it's not useful - it means that we don't have a good place for this. There are many other fundamental Python concepts we don't explain. So my recommendation is to drop this as well ♻️
Kedro project in a monorepo setup. I think this gets into murky territory. I dumped some thoughts on monorepos here Thoughts on monorepos (dependency isolation) #4147 I think the word "monorepo" is very overloaded, there are different use cases, and Python tooling to manage them is quite immature. I'm not opposed to having recommendations about this though - but I think it should go to a "how to organize big Kedro projects", and a more thorough breakdown of the different things people might want. So my recommendation would be to open a new ticket and a separate PR for it ➕

datajoely · 2024-10-21T08:56:53Z

Shall we close this and instead schedule a session to work out Kedro's view on this topic?

astrojuanlu · 2024-10-22T11:53:02Z

In my review comment #3852 (review) I recommended a way forward for the 4 distinct blocks I observed in this PR.

@datajoely Do you think we can do something else to clarify next steps?

datajoely · 2024-10-22T12:10:50Z

No I think it's great - but I'd love to do a session where think through those topics, possible co-design with power users like @yury-fedotov . It's hard to separate philosophy from opinion / tooling preference (which ages badly)

Even something like uv may look like black (which was novel in 2017) does today in 5 years....

yury-fedotov · 2024-10-22T18:47:03Z

@datajoely thanks! I'm happy to participate in this session :)
For now, let me close this PR, and then when I get to it in the future, open new ones for separate items, as @astrojuanlu proposed.

yury-fedotov added 2 commits May 6, 2024 07:50

Add a tutorial on code beyond starter files

5f4330f

Signed-off-by: Yury Fedotov <[email protected]>

Mention change in RELEASE.md

036cbe0

Signed-off-by: Yury Fedotov <[email protected]>

yury-fedotov marked this pull request as ready for review May 6, 2024 06:13

yury-fedotov requested review from yetudada, astrojuanlu and merelcht as code owners May 6, 2024 06:13

yury-fedotov and others added 2 commits May 6, 2024 09:37

Address Vale comments on UK endings

02f0274

Signed-off-by: Yury Fedotov <[email protected]>

Merge branch 'main' into docs/code-beyond-starter-files

5875e2d

merelcht reviewed May 14, 2024

View reviewed changes

yury-fedotov and others added 9 commits May 16, 2024 19:31

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

4d7a35a

Comment on customizations Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

ca00b72

For example in splitting Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

ff01643

Make provided examples title not bold Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

b858406

And as where appropriate Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

144783b

Organise spelling Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

cf3a65f

Proper wording for pip install note Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Update docs/source/kedro_project_setup/code_beyond_starter_files.md

729dc1a

Implies to consists replacement Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Yury Fedotov <[email protected]>

Merge branch 'refs/heads/main' into docs/code-beyond-starter-files

2d22bbf

Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md

Link deepdives to a list of examples

a58ff32

Signed-off-by: Yury Fedotov <[email protected]>

noklam reviewed May 20, 2024

View reviewed changes

astrojuanlu reviewed May 20, 2024

View reviewed changes

docs/source/kedro_project_setup/code_beyond_starter_files.md Outdated Show resolved Hide resolved

docs/source/kedro_project_setup/code_beyond_starter_files.md Outdated Show resolved Hide resolved

yury-fedotov added 2 commits May 25, 2024 12:29

Merge branch 'refs/heads/main' into docs/code-beyond-starter-files

2ed9639

Signed-off-by: Yury Fedotov <[email protected]> # Conflicts: # RELEASE.md

Revert weird MLflow release note edit

325489d

Signed-off-by: Yury Fedotov <[email protected]>

merelcht mentioned this pull request Jul 8, 2024

Close all (or as many as possible) open PRs! #3996

Closed

9 tasks

merelcht reviewed Jul 9, 2024

View reviewed changes

merelcht added 2 commits July 9, 2024 10:36

Apply suggestions from code review

6ee9ad2

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'main' into docs/code-beyond-starter-files

8e811cf

Signed-off-by: Merel Theisen <[email protected]>

merelcht requested review from noklam and removed request for yetudada July 9, 2024 09:42

yury-fedotov added 2 commits July 10, 2024 23:11

Merge branch 'main' into docs/code-beyond-starter-files

fd96088

Signed-off-by: Yury Fedotov <[email protected]>

Merge branch 'main' into docs/code-beyond-starter-files

4310cf6

merelcht assigned astrojuanlu, noklam and merelcht Jul 22, 2024

yury-fedotov and others added 2 commits July 22, 2024 18:38

Merge branch 'main' into docs/code-beyond-starter-files

0e3dcf7

Merge branch 'main' into docs/code-beyond-starter-files

fe19a45

noklam reviewed Aug 22, 2024

View reviewed changes

docs/source/kedro_project_setup/code_beyond_starter_files.md Show resolved Hide resolved

DimedS approved these changes Aug 22, 2024

View reviewed changes

yury-fedotov and others added 3 commits August 27, 2024 22:42

Merge branch 'main' into docs/code-beyond-starter-files

c20e741

Merge branch 'kedro-org:main' into docs/code-beyond-starter-files

a508fe0

Implement Nok's comments re: utility functions

52ed71e

Signed-off-by: Yury Fedotov <[email protected]>

astrojuanlu reviewed Sep 12, 2024

View reviewed changes

astrojuanlu removed their assignment Sep 12, 2024

yury-fedotov closed this Oct 22, 2024

astrojuanlu mentioned this pull request Nov 8, 2024

Describe how to use custom code with Kedro #3418

Open

		* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
		into `historical_pipeline.py` and `inference_pipeline.py`.

Create a docs page about adding code beyond starter files #3852

Create a docs page about adding code beyond starter files #3852

Conversation

yury-fedotov commented May 6, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

yury-fedotov commented May 13, 2024

yury-fedotov commented May 13, 2024

merelcht left a comment

Choose a reason for hiding this comment

yury-fedotov commented May 17, 2024

merelcht commented May 17, 2024

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stichbury commented May 20, 2024

merelcht commented May 20, 2024

astrojuanlu left a comment

Choose a reason for hiding this comment

stichbury commented May 20, 2024

yury-fedotov commented May 24, 2024

yury-fedotov commented Jun 3, 2024

You definition (copy pasting)

What I was trying to cover

Proposed next steps

merelcht commented Jul 9, 2024

yury-fedotov commented Jul 11, 2024 • edited Loading

DimedS left a comment

Choose a reason for hiding this comment

yury-fedotov commented Sep 7, 2024

astrojuanlu left a comment

Choose a reason for hiding this comment

datajoely commented Oct 21, 2024

astrojuanlu commented Oct 22, 2024

datajoely commented Oct 22, 2024

yury-fedotov commented Oct 22, 2024

yury-fedotov commented May 6, 2024 •

edited

Loading

yury-fedotov commented Jul 11, 2024 •

edited

Loading