-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to set a custom runner globally #3255
Comments
Can you explain what does this custom runner do? Do you mean that setting the default runner in settings.py? Would be great if you can share how you would like it to work. |
Can you explain what does this custom runner do? Would be great if you can share how you would like it to work. |
We are using Spark, DeltaTables and Azure Blob storage. I have been using Kedro for a while and now experimenting with passing the Dataset itself instead of the loaded data, to be able to customize the behavior inside of nodes. Basically, instead of doing dataset.load and pass that to the nodes, I pass the dataset itself and perform actions inside the nodes. I also have custom versions of the SparkDataset and DeltaDataset. The fact that most "save" actions are done on the Delta table and that "Read / Load" actions are done on the SparkDataset actually works in our favor in this case. The pipeline that manages data for Dataset A get the DeltaDataset A as an input and output the SparkDataset A. The pipelines / nodes that consume the data will just use the SparkDataset A as an input. Note that sometimes, SparkDataset A is not processed, but nodes that depend on it can still be run. In that case, the data should be loaded from the blob storage. SparkDataset is a read only dataset (with caching) in our case. We also have custom behavior that will instanciate an empty dataset if the table does not exist and a spark Schema is specified. This way, it is possible to run some nodes that depend on data that was never processed, but still provide meaningful outputs. Note that our usage of kedro is advanced. We basically have multiple "integrations" that get data from some api's or files and land that data in the Blob store. This data is then processed in a kedro project (one project per integration) using spark to produce cleaned outputs. These outputs are then leveraged in other kedro projects. |
We have the same issue, that we currently tackle by instantiating our custom runner in a custom cli that we inject through plugin framework. In settings.py: We set the runner class and potentielly the runner args I don't know if there is another way of doing that, but it's indeed a real need for some advanced usage. There is another issue that prevent using some custom runners natively in kedro, is that kedro cli doesn't support runner class arguments. The |
Thanks for this. I haven't played with Having to set settings using |
For the "entry_points": {
"kedro.project_commands": ["cli = <you_plugin_package>.<optional_module>:<cli_name>"],
} |
The original request of having a global setting for runner is reasonable to me, but I am a bit confused about the discussion of a programmatic way to configure project. Assuming you can set runner globally, isn't it just a single line configuration in The default way to share code among projects is building a plugin. Kedro expose certain entrypoints and hooks that you can use, the above is a good example. |
To continue the origin request, I think we need to expose the
This is not really a problem to your runner case. At the moment you can either choose to have a plugin that expose its own |
Description
I have built a custom runner and configuration loader. It would be useful to be able to specify the default runner to avoid having to set it on every run.
Context
We have multiple kedro projects that share code and rely on a custom runner that customizes the dataset loading behavior for nodes. Our pipelines simply won't work with any other runner.
The text was updated successfully, but these errors were encountered: