`PartitionDataset` Caching Support #974

lordsoffallen · 2025-01-03T19:02:22Z

Description

I have a node which returns dict[str, Callable] for kedro to save my partitioned data. I've often had cases where it was failing mid way due to edge case i didn't cover and execution starts from all over again.

Context

I would need this to speed up experimentation in kedro and reduce unnecessary costs which may occur by re-running the node.

Possible Implementation

Adding a new parameter to PartitionDataset to support skipping already existing files. Something like use_cache: True

Possible Alternatives

I can def inherit the class and implement this but i thought it would be useful feature to have it in the core code.

The text was updated successfully, but these errors were encountered:

fgassert · 2025-01-15T15:48:53Z

There's some discussion of this in #928.

I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful.
https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a

lordsoffallen · 2025-01-16T10:36:04Z

There's some discussion of this in #928.

I've written a couple custom datasets for this use case and for parallel processing of partitions, attached here in case they're helpful. https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a

I think they're different. I am okay with sequential execution but I wanted to support continue where it is left off. Ideally it's easy to hack but seemed like a nice feature to have in kedro

fgassert · 2025-01-16T10:46:29Z

Try the third RobustPartitionedDataset? It's patterned off of the builtin incremental dataset to address some edge cases. You can set it up like a regular PartitionedDataset, with the additional parameter behavior: complete_missing

mydataset:
  type: <my-project>.datasets.robust_partitioned_dataset.RobustPartitionedDataset
  path: ...
  dataset:
    type ...
  behavior: complete_missing

https://gist.github.com/fgassert/c6c9a87c47d2eaffd30d3f72b0ff675a#file-robust_partitioned_dataset-py

merelcht added the Community Issue/PR opened by the open-source community label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PartitionDataset` Caching Support #974

`PartitionDataset` Caching Support #974

lordsoffallen commented Jan 3, 2025

fgassert commented Jan 15, 2025

lordsoffallen commented Jan 16, 2025

fgassert commented Jan 16, 2025

PartitionDataset Caching Support #974

PartitionDataset Caching Support #974

Comments

lordsoffallen commented Jan 3, 2025

Description

Context

Possible Implementation

Possible Alternatives

fgassert commented Jan 15, 2025

lordsoffallen commented Jan 16, 2025

fgassert commented Jan 16, 2025

`PartitionDataset` Caching Support #974

`PartitionDataset` Caching Support #974