Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean versioned dataset automatically/periodically #1658

Closed
LindaWeijiaSun opened this issue Jun 30, 2022 · 4 comments
Closed

Clean versioned dataset automatically/periodically #1658

LindaWeijiaSun opened this issue Jun 30, 2022 · 4 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@LindaWeijiaSun
Copy link

Description

In catalog.yml file, we can enable versioning by adding versioned = True , so Kedro will generate data with timestamp as the folder name each time. Currently, I don’t need the data older than a week, for example. It would be convenient if there is a functionality in Kedro that can enable users to automatically/periodically remove the older data.

Context

I'm working on on-premise environment with limited storage, so I don't need the generated data older than a certain time period. Another context would be to remove historical data due to data privacy/protection issue.

Possible Implementation

Add delete option for dataset in catalog.yml when versioned = True is enabled. Possible parameter could be time, and by default it could be infinite time (i.e., not deleting at all)

A highly relevant previous issue: Support archiving/deleting old/unused datasets #406

@LindaWeijiaSun LindaWeijiaSun added the Issue: Feature Request New feature or improvement to existing feature label Jun 30, 2022
@deepyaman
Copy link
Member

I'm working on on-premise environment with limited storage, so I don't need the generated data older than a certain time period. Another context would be to remove historical data due to data privacy/protection issue.

#406 was also drafted due to issues with limited storage, even though I'd like to point out that "limited" can still be pretty generous (e.g. was already using 169 GB storage in #406). At some point, it's reasonable to want to have a way to delete data. :D

@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

Found a related comment in #1076

I completely missed this issue. I'm excited to see this one happen. We had another use case this week. We are using some versioned datasets and would like to be able to create a plugin to clean up some of the old versions, inspired by sanoid. It would be nice if we could pop some extra config into the catalog to specify a policy of how many to keep, then we can create a separate plugin that can clean these up in a scheduled run. And keep all of our configurations for each dataset together.

my_dataset:
  versioned: true
  extras: # whatever place you want to give us to configure plugins/extras
    cleanup_old_versions_plugin:
      frequent: 4
      daily: 7
      year: 1
      month: 1
  

@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

I have a question too.

How do we know what/where to delete?

  • Looking at DataCatalog - this is okay but it will leave "ghost" file as long as the catalog changes. I used to have scripts that scan through a certain S3 bucket and delete certain files that are too old, and keep some of the flagged files (these flagged are stored in an experiment tracking tool, they are essentially artifact stores).

@merelcht
Copy link
Member

Closing in favour of #1799 so we can collect all thoughts + comments in the same place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants