Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option to Include Configuration (conf) in the Kedro Package for Source Code? #4316

Open
MinuraPunchihewa opened this issue Nov 10, 2024 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@MinuraPunchihewa
Copy link
Contributor

Description

At the moment, when packaging a Kedro project using kedro package, two artifacts are created; a .whl file containing the source code and a tar.gz file consisting of the configuration in conf directory.

I understand that it is (in general) a best practice, however, from a user's point of view, these are two artifacts to be maintained. In my CI/CD pipelines as well as in the mechanism for running the code, I need to account for them, which makes the process a little more complicated.

Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the tar.gz file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.

It would be great if it would be possible to include the conf folder as part of the .whl file (at the user's request). This would make the maintenance, installation and usage of the artifacts easier.

Context

As I mentioned, this could go a long way in improving the usage of package Kedro projects, especially when running in distributed computing environments.

Possible Implementation

I suggest adding a flag to the kedro package command to allow users to choose whether to include the conf folder in the .whl file. The default mechanism can be to avoid doing this.

Possible Alternatives

A possible alternative might be to improve the documentation to explain how Kedro projects can be packaged and run in different systems (with examples), however, this might not be very extensible given the large number of different options that are available for running pipelines.

@MinuraPunchihewa MinuraPunchihewa added the Issue: Feature Request New feature or improvement to existing feature label Nov 10, 2024
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Nov 10, 2024
@datajoely
Copy link
Contributor

So we've resisted this for years since it violates the 12factor app the addition of --conf-source is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.

I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.

@MinuraPunchihewa
Copy link
Contributor Author

So we've resisted this for years since it violates the 12factor app the addition of --conf-source is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.

I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.

Hmm.. Yes, agreed. I think this might actually be the best solution to the problem? It would resolve the issues I am facing in a distributed environment as well.

I am happy to open a PR for both, but I just wanted to confirm if we won't to add an option to include the configuration in the package?

@ankatiyar
Copy link
Contributor

Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the tar.gz file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.

@MinuraPunchihewa, --conf-source works with the .tar.gz file (since Kedro 0.18.7 I believe) so there's no need to unzip the compressed file. Does that feature alleviate some of these concerns?

Adding remote cloud options to --conf-source (#3982) has been discussed by the team and ready to be implemented if you wanted to open a PR for this, please feel free! However, the option to include configuration in .whl file might warrant some discussion with the team. I will put this issue in our backlog and we'll add more info to this ticket on what we decide!

Thanks 💯

@MinuraPunchihewa
Copy link
Contributor Author

Contributor

Thanks, @ankatiyar. I was not aware that the .tar.gz file could be directly passed to --conf-source. I will try doing this in my Databricks environment and check if it works.

Yes, sure. I will open a PR for passing in remote configurations.

Got it. Let me know what you decide.

@noklam
Copy link
Contributor

noklam commented Nov 12, 2024

  1. Adding remote support for conf_source make perfect sense to me.
  2. Packaging conf into the wheel. The problem of packagin conf is that it's not possible to change the configuration in code. The alternative is, move conf into src and change CONF_SOURCE inside settings.py, and this is supported already and require no change. I don't see a massive benefit of adding a flag to package conf into wheel.

@datajoely
Copy link
Contributor

@MinuraPunchihewa a possible reference implementation can be found here where we do something similar for micropackaging (even if that feature is currently deprecated!)

def _unpack_sdist(location: str, destination: Path, fs_args: str | None) -> None:

@astrojuanlu
Copy link
Member

A few thoughts aside from #3982:

(1) About bundling configuration with the source code, I know that some users do this already. Managing the two separately is cumbersome.

For example, deploying on Airflow (and I bet in most other platforms) requires an extra step because of this https://docs.kedro.org/en/stable/deployment/airflow.html

(2) Some users just don't see the point of kedro package at all, we noticed that during the deployment research #4325

(3) We've long known that not all our configuration is created equal, see this discussion by @Galileo-Galilei #770 In particular, dataset types are intimately coupled with business logic (because they define the in-memory representation of the data) and arguably they are not "in the same league" as, say, model parameters.

(4) I think our interpretation of the 12factor app is quite idiosyncratic. From the text, it reads

This is a violation of twelve-factor, which requires strict separation of config from code.

But the subtitle of section III is

Store config in the environment

And further down it says

The twelve-factor app stores config in environment variables (often shortened to env vars or env).

The reluctance of Kedro against environment variables is well known, we almost only allow it for credentials by default #2623 (and that's not even enough for many use cases, see #1621, long discussion at #3811).


This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.

@datajoely
Copy link
Contributor

This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.

I'd add the word pragmatic somewhere in there, but I agree wholeheartedly

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Nov 13, 2024

I will add more thoughts on what I think we should do regarding this topic in #770 as promised to @astrojuanlu.

That said (and as @noklam said), notice that it is already very easy to implement in kedro>=0.19 (and kedro>=0.18.? if you enable OmegaConfigLoader) with the following procedure:

  1. Create an empty src/conf_app folder
  2. Copy the content of conf/base in src/conf_app
  3. Delete conf/base folder
  4. Update settings.py as follow:
# settings.py
from pathlib import Path

CONFIG_LOADER_ARGS = {
    "base_env": (Path(__file__).parents[1] / "conf_app").as_posix(),
    "default_run_env": "local",
}

Good news is, this does work even if you override partially CONF_SOURCE passed through the CLI (and this is likely a happy accident 🤯).

You can get much more detail and explanations in this demo repository.

PS: I haven't tried with kedro package but it should be packaged with the src folder (else just add include_data=True in the setup.py, see #1607)

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

7 participants