Ability to set a custom runner globally #3255

jptissot · 2023-10-31T18:58:39Z

Description

I have built a custom runner and configuration loader. It would be useful to be able to specify the default runner to avoid having to set it on every run.

Context

We have multiple kedro projects that share code and rely on a custom runner that customizes the dataset loading behavior for nodes. Our pipelines simply won't work with any other runner.

noklam · 2023-11-01T00:50:02Z

Can you explain what does this custom runner do? Do you mean that setting the default runner in settings.py?

Would be great if you can share how you would like it to work.

noklam · 2023-11-01T00:56:18Z

Can you explain what does this custom runner do?

Would be great if you can share how you would like it to work.

jptissot · 2023-11-01T13:35:20Z

We are using Spark, DeltaTables and Azure Blob storage. I have been using Kedro for a while and now experimenting with passing the Dataset itself instead of the loaded data, to be able to customize the behavior inside of nodes. Basically, instead of doing dataset.load and pass that to the nodes, I pass the dataset itself and perform actions inside the nodes. I also have custom versions of the SparkDataset and DeltaDataset.

The fact that most "save" actions are done on the Delta table and that "Read / Load" actions are done on the SparkDataset actually works in our favor in this case. The pipeline that manages data for Dataset A get the DeltaDataset A as an input and output the SparkDataset A. The pipelines / nodes that consume the data will just use the SparkDataset A as an input. Note that sometimes, SparkDataset A is not processed, but nodes that depend on it can still be run. In that case, the data should be loaded from the blob storage. SparkDataset is a read only dataset (with caching) in our case.

We also have custom behavior that will instanciate an empty dataset if the table does not exist and a spark Schema is specified. This way, it is possible to run some nodes that depend on data that was never processed, but still provide meaningful outputs.

Note that our usage of kedro is advanced. We basically have multiple "integrations" that get data from some api's or files and land that data in the Blob store. This data is then processed in a kedro project (one project per integration) using spark to produce cleaned outputs. These outputs are then leveraged in other kedro projects.

takikadiri · 2023-11-01T21:30:10Z

We have the same issue, that we currently tackle by instantiating our custom runner in a custom cli that we inject through plugin framework.

In settings.py: We set the runner class and potentielly the runner args
In cli.py: We import project settings and instantiate the runner object using runner class and runner args. The custom runner object will then be used by the KedroSession

I don't know if there is another way of doing that, but it's indeed a real need for some advanced usage.

There is another issue that prevent using some custom runners natively in kedro, is that kedro cli doesn't support runner class arguments. The --runner take only the class then the cli instatiate the class without having a way to inject some init arguments.

jptissot · 2023-11-02T15:21:12Z

In settings.py: We set the runner class and potentielly the runner args In cli.py: We import project settings and instantiate the runner object using runner class and runner args. The custom runner object will then be used by the KedroSession

Thanks for this. I haven't played with cli.py yet. I will investigate if this could work as a stop gap.

Having to set settings using settings.py and cli.py becomes repetitive when multiple projects share the same configuration. It would be much simpler if the settings used some sort of builder pattern that allow us to configure the settings of a project programmatically instead of relying solely on some "magic" in a settings.py file.

takikadiri · 2023-11-02T15:38:31Z

For the cli part you can set it globally in a plugin (python package) that you add to you projects dependencies.
In setup.py of your plugins, you can add this entry point, so all kedro projects will automatically use your custom cli when executing kedro run

   "entry_points": {
        "kedro.project_commands": ["cli = <you_plugin_package>.<optional_module>:<cli_name>"],
}

noklam · 2023-11-08T11:47:49Z

Having to set settings using settings.py and cli.py becomes repetitive when multiple projects share the same configuration. It would be much simpler if the settings used some sort of builder pattern that allow us to configure the settings of a project programmatically instead of relying solely on some "magic" in a settings.py file.

The original request of having a global setting for runner is reasonable to me, but I am a bit confused about the discussion of a programmatic way to configure project.

Assuming you can set runner globally, isn't it just a single line configuration in settings.py? Can you explain how a builder pattern would make this easier to work for multiple project?

The default way to share code among projects is building a plugin. Kedro expose certain entrypoints and hooks that you can use, the above is a good example.

noklam · 2023-11-08T11:57:42Z

To continue the origin request, I think we need to expose the runner class in settings.py.

There is another issue that prevent using some custom runners natively in kedro, is that kedro cli doesn't support runner class arguments. The --runner take only the class then the cli instatiate the class without having a way to inject some init arguments.

This is not really a problem to your runner case. At the moment you can either choose to have a plugin that expose its own run function. There is an open ticket to explore around the idea of adding addition arguments, the idea is to use similar system like pytest. i.e. pytest-cov

Spike: Provide a way for plugins to have runtime configuration and extend CLI arguments #2866

jptissot added the Issue: Feature Request New feature or improvement to existing feature label Oct 31, 2023

github-actions bot mentioned this issue Nov 1, 2023

Monthly issue metrics report #3256

Closed

takikadiri mentioned this issue Jan 21, 2024

Kedro Boot Apps Settings and CLIs takikadiri/kedro-boot#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to set a custom runner globally #3255

Ability to set a custom runner globally #3255

jptissot commented Oct 31, 2023

noklam commented Nov 1, 2023

noklam commented Nov 1, 2023

jptissot commented Nov 1, 2023 •

edited

Loading

takikadiri commented Nov 1, 2023

jptissot commented Nov 2, 2023

takikadiri commented Nov 2, 2023 •

edited

Loading

noklam commented Nov 8, 2023 •

edited

Loading

noklam commented Nov 8, 2023

Ability to set a custom runner globally #3255

Ability to set a custom runner globally #3255

Comments

jptissot commented Oct 31, 2023

Description

Context

noklam commented Nov 1, 2023

noklam commented Nov 1, 2023

jptissot commented Nov 1, 2023 • edited Loading

takikadiri commented Nov 1, 2023

jptissot commented Nov 2, 2023

takikadiri commented Nov 2, 2023 • edited Loading

noklam commented Nov 8, 2023 • edited Loading

noklam commented Nov 8, 2023

jptissot commented Nov 1, 2023 •

edited

Loading

takikadiri commented Nov 2, 2023 •

edited

Loading

noklam commented Nov 8, 2023 •

edited

Loading