Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poetry Support for Kedro Projects #1722

Closed
MrDataPsycho opened this issue Jul 22, 2022 · 35 comments
Closed

Poetry Support for Kedro Projects #1722

MrDataPsycho opened this issue Jul 22, 2022 · 35 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@MrDataPsycho
Copy link

MrDataPsycho commented Jul 22, 2022

Description

The way kedro initiate a new project and create the folder structure does not goes well with Poetry . Usually I would create a Poetry environment before doing anything and then install all my required pacakges one by one. After I create a Poetry environment and added the kedro package the pyproject toml looks as followes:

poetry new --src KedroPoetry

[tool.poetry]
name = "KedroPoetry"
version = "0.1.0"
description = ""
authors = ["Your Name <[email protected]>"]

[tool.poetry.dependencies]
python = "^3.8"
kedro = {version = "~0.18.2", python = ">=3.8,<3.11"}

[tool.poetry.dev-dependencies]
pytest = "^7.1"

... (more lines)

Lets run the demo pytest to see if everything works.

poetry run pytest .

This goes well:

KedroPoetry|⇒ poetry run pytest .
platform darwin -- Python 3.8.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/ALAMSHC/PythonProjects/KedroPoetry
collected 1 item                                                                                                       

tests/test_kedropoetry.py .                                                                                      [100%]

Now its time to add a Kedro Project: kedro new

The command completely ignored the current pyproject.toml file. and as there is a src file it did not add the project in the src folder instead create a directory on the root outside of src. Now there is no kedro setup section in pyproject.toml so kedro cli will complain for broken setup.

Context

As Poetry provide one of the modern approach for packaging Python projects it will be good to have direct support for Poetry like project structure for Kedro or at-least a hackable way out will also work.

Possible Implementation

There could be a new flag in cli to initiate project with Kedro when there is already a pyproject.toml file and a project setup for Poetry.

@MrDataPsycho MrDataPsycho added the Issue: Feature Request New feature or improvement to existing feature label Jul 22, 2022
@MrDataPsycho MrDataPsycho changed the title Poetry Support for Kedro Project and Kedro Viz Poetry Support for Kedro Projects Jul 22, 2022
@noklam
Copy link
Contributor

noklam commented Jul 22, 2022

Thanks for raising the issue, would be great if you can provide some kind of tree output to show the folder structure. I haven't used poetry myself. Better if you can create a demo Github repository so I can clone and playaround with it.

In general, a "Kedro Project" itself is the top directory, do you currently have a workaround?

let say your new project is called new_project

  1. kedro new # STDIN = new_project
  2. copy everything out from the directory 1 level-up (I imagine you have to manually merge the pyproject.toml as well)

@MrDataPsycho
Copy link
Author

Sure On my way. Will share a git rep soon.

@MrDataPsycho
Copy link
Author

MrDataPsycho commented Jul 22, 2022

@noklam please find the demo project in the following git repo. The code generation steps and some expected Ideas are given the README file in the repository.

@deepyaman
Copy link
Member

The way kedro initiate a new project and create the folder structure does not goes well with Poetry . Usually I would create a Poetry environment before doing anything and then install all my required pacakges one by one.

@DataPsycho poetry new creates a Poetry project template, whereas kedro new creates a Kedro project template (which, by default, is pip-based). However, Kedro also provides the ability to use other templates, though Kedro starters. I think it would make sense to create a Poetry starter for Kedro, if want kedro new to play optimally with Poetry. You wouldn't use the poetry new command in that case, but you'd get a Poetry-compatible project.

@noklam
Copy link
Contributor

noklam commented Jul 25, 2022

First of all, thank you @DataPsycho for writing a very detailed README which is easy to follow. I think @deepyaman approach is preferrable.

(kedropoetry-9Q6y5a-v-py3.8) datapsycho@dataops:~/.../KedroPoetry$ tree . -L 1
.
├── poetry.lock
├── pyproject.toml
├── README.md
├── sample-project
├── src
└── tests

With the structure that you provided, basically you need to copy everything inside sample-project to the same directory. I think you will have crashes on README.md, pyproject.toml

Here are the potential alternatives I am thinking about:

Starting from a fresh project with kedro new --starter=poetry

It should just work, no extra file deletion needed

Starting from an existing poetry project

kedro new --starter=poetry  # assume project named sample-project
kedro run  # Should just work out of the box

Then you will need to copy out the file from sample-project to the same directory

You will still have crashes on these file since I think it's not straight forward to auto-resolve/merge these file

❌ README.md
❌ pyproject.toml. # library dependecies etc
✅ There will be no extra setup.py to delete
✅ There will be no requirements.txt to delete

So you will save 2 delete options with this workflow, but you still have to deal with resolving the dependencies. It would be easier to just start with a poetry-compatible project from the start. Thoughts?

@MrDataPsycho
Copy link
Author

Hi, There is more file to move around.

(kedropoetry-9Q6y5a-v-py3.8) datapsycho@dataops:~/.../KedroPoetry$ tree . -L 1
.
├── poetry.lock
├── pyproject.toml
├── README.md
├── sample-project
├── src
└── tests

After I get that structure. I have to do the following moving:

  • Copy The stuff from sample-project > pyproject.tom into the pyproject.toml
  • Copy conf, data, notebook, docs, logs into project root
  • Copy sample-project > src > sample_project project into src > sample_project
  • Copy sample-project > src > tests into src > tests >
  • Install all the required package necessary and delete the sample-project directory
    Now The Kedro cli and poetry is in harmony.

I always have to start with poetry first. Using poetry I have add kedro as a package for the virtual environment of the project. Then I am able to use kedor. But reverse is not what a poetry use would do: Create a venv install poetry in it and activate it. then create a new project with kedro and go inside of the project then start poetry into the repo which will create another new poetry-venv. now old venv will have no use.

@noklam
Copy link
Contributor

noklam commented Jul 26, 2022

@DataPsycho Is there any difference that just go inside sample-project and select & cut all and paste 1 level up?

I don't quite understand why it necessarily create another poertry environment, I may test it out tomorrow.

@MrDataPsycho
Copy link
Author

MrDataPsycho commented Jul 27, 2022

Its fine to do the copy pasting. But Kedro is a package with cli. But Poetry is an environment management and package management system. How do I use kedro from the start without poetry or Pipenv?

  • Create a virtualenv
  • Install Kedro with pip
  • Init a project with Kedro
  • cd into the project and install the packages in src/requirements.txt

If I install Kedro in the base python image:

  • Using base Kedro Init a Project
  • Create a virtual environment for the project
  • Install all the packages from src/requirements.txt which kedro generates

But now I am locked with the Kedro version, I can not move between versions. I have to create all my projects with same Kedro version. So this is a no go for me.

If want to use Poetry:

  • I have to create a poetry project poetry --src myproject
  • cd into the project and add kedro as a package dependency with poetry add kedro Poetry will create a virtualenv while install ing Kedro for that particular project PIPENV will do the same actually
  • Now I have to initialize kedro project and start copy pasting stuff and restructure manually explained above

To be able to use Kedro first I must have to create a virtual environment first and install Kedro in it. But Poetry responsible for creating a virtual environment and adding Kedro init. So If want to use Kedro to initialize a project I need a virtualenv with kedro but then after initialization of the project when I cd into the project and initialize poetry with poetry init then poetry would want to create another environment and the previous Kedro virtual environment which is used to initialize the kedro project will have no use. So in that way I will have to create 2 environment and delete the first one if I wan to use Poetry.

@noklam
Copy link
Contributor

noklam commented Jul 27, 2022

@DataPsycho I agree this is not the smoothest experience.

I just want to mention that your kedro version doesn't necessarily tie to your new project version. By default if you have 0.18.1, it will generate a 0.18.1 Kedro template, but you can override that default if necessary. So this may be a workaround if you need to create new Kedro projects frequently.

--checkout TEXT An optional tag, branch or commit to checkout in the
starter repository.

@MrDataPsycho
Copy link
Author

Ok. Then we can close the feature request I guess. Thanks for your support and the time you have spent. @deepyaman 's Idea was great. I will see If I will have time to create a new starter for poetry like project structure. For now we can close it. I will close it by tomorrow, if you have nothing to add. Thanks

@noklam
Copy link
Contributor

noklam commented Jul 27, 2022

@DataPsycho
This example that we have in test is a good starting point.
https://github.com/kedro-org/kedro/blob/main/features/steps/test_plugin/plugin.py

You can find more info how you can extend it with kedro new --starter=custom_starter in this link.
https://kedro.readthedocs.io/en/0.18.2/extend_kedro/plugins.html?highlight=kedrostarterspec#extend-starter-aliases

@MrDataPsycho
Copy link
Author

A new starter might be added for poetry/PIPENV.

@antonymilne
Copy link
Contributor

antonymilne commented Aug 1, 2022

I'm reopening this because I think it's a very good topic and I'd be interested in hearing from other users about it 🙂 It's been mentioned several times before by differently people but we've never had thoughts collected together in one place, so let's start doing that here! In the past we've also wondered whether we should switch to using poetry. Currently we support a pip-compile workflow but we're planning to remove that in favour of just a plain requirements.txt file. Given #1724, it might be time to re-assess what system we use exactly.

Some previous related issues (there's probably others too):
#398
#391

From these and other conversations I know the following users have independently shown interest in kedro + poetry. There's also been interest within QB, though I'm not sure exactly who. So I definitely think there's some significant interest in this. @datajoely do you know anyone else here?
@fkromer @danhje @Kastakin @Larkinnjm1 @shaunc

@antonymilne antonymilne reopened this Aug 1, 2022
@antonymilne antonymilne added the Community Issue/PR opened by the open-source community label Aug 1, 2022
@datajoely
Copy link
Contributor

Carlos Bareto, but I don't know his GitHub handle

@arnaldog12
Copy link
Contributor

TBH, I like the idea of adding support for Poetry in Kedro projects. I think the main advantages of Poetry are:

  • it's a widely used package manager (more than pip-compile at least)
  • eliminate the need for setup.py
  • it provides a better way to manage project/dev requirements.

@brendalf
Copy link

brendalf commented Aug 3, 2022

I agree with @arnaldog12. Also, I integrated my current project with Poetry. If you want, I can share that as a poetry starter.

@MrDataPsycho
Copy link
Author

Much appreciate the initiative. Happy to share any knowledge needed which I have already tried to develop the starter template.

@deepyaman
Copy link
Member

One note for posterity on using Poetry with Kedro projects--there was a fix that's especially relevant to Kedro projects added in Poetry 1.2.0b3. Before this, you need to make sure to define any extras like pandas.csvdataset in all lowercase.

@cupdike
Copy link

cupdike commented Aug 23, 2022

So if I'm new to poetry and new to kedro, and I've installed poetry 1.2.0rc1, what's the best way to proceed at this point?

@MrDataPsycho
Copy link
Author

So if I'm new to poetry and new to kedro, and I've installed poetry 1.2.0rc1, what's the best way to proceed at this point?

If still there is no better way, follows this thread above what I had to make kedro compatible with poetry

@eliorc
Copy link

eliorc commented Oct 18, 2022

I've read both this and the closed issue I haven't found any mention of the relationship between kedro run and poetry run. You might assume that the user always execute poetry shell but in truth the correct way to execute things within a poetry environment is using poetry run.
So assuming I have a starter which is both kedro and poetry compliant, do we expect to use poetry run kedro run...?

@schoennenbeck
Copy link

I've read both this and the closed issue I haven't found any mention of the relationship between kedro run and poetry run. You might assume that the user always execute poetry shell but in truth the correct way to execute things within a poetry environment is using poetry run. So assuming I have a starter which is both kedro and poetry compliant, do we expect to use poetry run kedro run...?

kedro provides an entrypoint for your project in the __main__.py file. So if you add the following to your pyproject.toml

[tool.poetry.scripts]
my_project = "my_project.__main__:main"

you can run poetry run my_project -p ... which feels pretty natural.

@astrojuanlu
Copy link
Member

For folks subscribed to this old issue: we're (1) modernizing the way Kedro projects are structured, to make them look more similar to normal Python libraries https://github.com/kedro-org/kedro/milestone/36 and (2) looking into ways to initialize a Kedro project in an existing directory #2512.

Our idea though is to favor PEP 621 compliant pyproject.toml files, which are not yet supported by Poetry python-poetry/poetry#3332 so it will still take us some time to get there. The good news is that we would be very close to actual support, and maybe by that time Poetry will be soft-compatible with PEP 621 already.

@astrojuanlu
Copy link
Member

Today I found a project that uses Poetry + Kedro: https://github.com/madziejm/project-fontr

People subscribed to this issue, could you have a look and let us know what else can we do to better support this use case? Otherwise I'm voting to close the issue.

@ac-willeke
Copy link

ac-willeke commented Sep 15, 2023

Hi! Great that you are working on it and hopefully poetry moves to PEP 621 soon :)

For people wondering how you can use a conda-poetry-kedro setup for now, I use it as follow:

  1. set up conda and install poetry and kedro
conda create --name myenv
conda activate myenv
conda install -c conda-forge poetry
pip install kedro
  1. create new kedro project
    kedro new

  2. init poetry env (within the activated conda env)
    poetry init

  3. install the "kedro" dependencies src\requirements.txt in the conda env

	poetry add "black~=22.0"
	poetry add "flake8>=3.7.9,<5.0"
	poetry add "ipython>=7.31.1, <8.0; python_version < '3.8'"
	poetry add "ipython~=8.10; python_version >= '3.8'"
	poetry add "isort~=5.0"
	poetry add "jupyter~=1.0"
	poetry add "jupyterlab~=3.0"
	poetry add "jupyterlab~=3.0"
	poetry add "kedro~=0.18.13"
	poetry add "kedro-datasets[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]~=1.0"
	poetry add "kedro-telemetry~=0.2.0"
	poetry add "kedro-viz~=6.0"
	poetry add "nbstripout~=0.4"
	poetry add "pytest-cov~=3.0"
	poetry add "pytest-mock>=1.7.1, <2.0"
	poetry add "pytest~=7.2"
	poetry add "scikit-learn~=1.0"

I use poetry add to ensure that all dependencies are stored in the pyproject.toml, but you could also install them directly using poetry run pip install -r src/requirements.txt. In that case they are not registered in the pyproject.toml but are installed in your env.

  1. Delete redundant files src\pyproject.toml and src\requirements.txt.

  2. Run kedro in conda-poetry with kedro run

You should end up with a pyproject.toml looking like this (see .txt), which you can then use in the future to init your poetry env directly using poetry install --no-root.

pyproject.txt

@hamiddimyati
Copy link

Hi @ac-willeke

Thanks for the tips. But, doesn't using conda along with poetry seem to be a redundant tool? Both are equally used for virtual environment and package manager.

@Krabsenm
Copy link

Hi @ac-willeke

Thanks for the tips. But, doesn't using conda along with poetry seem to be a redundant tool? Both are equally used for virtual environment and package manager.

Many use conda to specify python version within the virtual env, another option is pyenv

@ac-willeke
Copy link

Hi!

Yes, I agree conda/poetry is redundant. I used to combine the two in projects with libraries that are not easily installed using poetry. For example, python bindings for gdal (dependent on C++) are not that easy to install if you don't have admin rights. So then I would start my project with conda, install gdal, install all other packages using poetry (as I like the clean structure of poetry).

But I recently moved to gdal images from docker, so then you can use solely poetry as a package manager :)

So maybe my example above with the conda/poetry env was not the best, sorry for that!

@astrojuanlu
Copy link
Member

astrojuanlu commented Dec 4, 2023

Did some experiments today and I confirm Kedro supports Poetry. Or Poetry supports Kedro, depending on how you want to look at it.

Starting point:

❯ tree
.
├── README.md
├── pyproject.toml
└── src
    └── test_poetry
        └── __init__.py

3 directories, 3 files
❯ cat pyproject.toml 
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "test-poetry"
version = "0.1.0"
description = ""
authors = ["Juan Luis Cano Rodríguez <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"

Then added the necessary files (for example using https://github.com/astrojuanlu/kedro-init):

❯ kedro-init .
[00:05:14] Looking for existing package directories                                                                            cli.py:25
[00:05:20] Initialising config directories                                                                                     cli.py:25
           Creating modules                                                                                                    cli.py:25
           🔶 Kedro project successfully initialised!                                                                          cli.py:26
❯ tree
.
├── README.md
├── conf
│   ├── base
│   └── local
├── pyproject.toml
└── src
    └── test_poetry
        ├── __init__.py
        ├── pipeline_registry.py
        └── settings.py

6 directories, 5 files
❯ git diff
diff --git a/pyproject.toml b/pyproject.toml
index 26ac21c..cdcbbd4 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,3 +11,8 @@ readme = "README.md"
 
 [tool.poetry.dependencies]
 python = "^3.10"
+
+[tool.kedro]
+project_name = "test-poetry"
+package_name = "test_poetry"
+kedro_init_version = "0.18.14"

Now everything works:

❯ kedro registry list
- __default__

❯ kedro pipeline create data_processing
Using pipeline template at: '/private/tmp/test-poetry/.venv/lib/python3.10/site-packages/kedro/templates/pipeline'
Creating the pipeline 'data_processing': OK
  Location: '/private/tmp/test-poetry/src/test_poetry/pipelines/data_processing'
Creating '/private/tmp/test-poetry/src/tests/pipelines/data_processing/__init__.py': OK
Creating '/private/tmp/test-poetry/src/tests/pipelines/data_processing/test_pipeline.py': OK
Creating '/private/tmp/test-poetry/conf/base/parameters_data_processing.yml': OK

Pipeline 'data_processing' was successfully created.
❯ tree | grep -v '\.pyc$'
.
├── README.md
├── conf
│   ├── base
│   │   └── parameters_data_processing.yml
│   └── local
├── pyproject.toml
└── src
    ├── test_poetry
    │   ├── __init__.py
    │   ├── __pycache__
    │   ├── pipeline_registry.py
    │   ├── pipelines
    │   │   └── data_processing
    │   │       ├── __init__.py
    │   │       ├── nodes.py
    │   │       └── pipeline.py
    │   └── settings.py
    └── tests
        └── pipelines
            └── data_processing
                ├── __init__.py
                └── test_pipeline.py

I don't think there's anything else we'll do for now. kedro new will likely keep using setuptools for the time being. Now that Kedro projects are mostly Python libraries, people can initialise them any way they want (poetry init, poetry new, pdm init, flit init), add some extra files and configs, and work normally.

I'm closing this issue, feel free to keep commenting if you disagree.

@GuiMarthe
Copy link

Hey @astrojuanlu , I wasn't able to pip install the kedro-init package. Did this make it into a release within kedro?

@astrojuanlu
Copy link
Member

Hello @GuiMarthe , kedro-init is an experiment I made, not officially maintained and I didn't publish it to PyPI. But you can install it from GitHub https://github.com/astrojuanlu/kedro-init it's a tool you only need once per project.

If there's traction and interest I will consider publishing it to PyPI. Voice your interest here or opening an issue on https://github.com/astrojuanlu/kedro-init/issues

@ourownstory
Copy link

@astrojuanlu Thank you for Kedro, and for your explanations on what is needed to use Poetry!

As I did want to use a Poetry managed virtual environment, and prefer not to depend on an experimental repo, I experimented myself and found an acceptable workflow, that I documented (for myself) in this little Guide to use Kedro with Poetry. I hope it may be helpful to others too. Please let me know if I missed anything and feel free to use/share this if you deem it useful.

Also, please let me know if you decide to properly support Poetry initialization.

@astrojuanlu
Copy link
Member

Thanks for sharing @ourownstory ! Your writeup reminded me that kedro-init doesn't account for the new project tools of Kedro 0.19. I might give it another pass and publish a 0.1 version to PyPI :)

@ourownstory
Copy link

Great, thank you! I hope this may lead to the eventual integration to the main package similar to OP's suggestion?

@astrojuanlu
Copy link
Member

For that, let's continue the conversation in #681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests