You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background:
Pydiverse pipedag is currently more used at the beginning of pipelines before training actual models. In the process of increasing adoption further down, a demand was raised for decoupling the experimentation with model training from the early part of preparing the input. The model training part may also include a thin layer of feature engineering which is more about selecting, encoding, and combining base feature data. We call it "feature encoding" here. A general pipeline might be split as follows:
input preparation:
raw ingestion -> cleaning for inspection -> best possible representation for economic reasoning -> feature engineering
model training:
-> feature encoding(s5) -> model training(s6) -> model evaluation(s7)
Idea:
Similar to the local table cache, the pipedag configuration could include another table store called "persistence_table_store". When having a flow f and a task output t, one could actively create a manually persisted version of t with version_id = f.persist([t]). When executing model training code, one could run f.run([s5,s6,s7], persisted_inputs={t: version_id}). Assuming the persistence_table_store is configured as parquet storage in a certain directory or s3 bucket, the version_id would be a monotonically increasing string (i.e date and time) which makes it reasonably easy to manually delete all persisted versions older than 3 months for example. Since it is done manually, keeping a few or all versions long term is no problem.
Interaction:
This PR may overlap with Support <n> cache slots #75 in terms of solving similar demand. However, the manual managed purging of versions and the automatically managed counterpart of slots are two quite different approaches to offer solutions.
It is unclear whether f.run([s5,s6,s7], persisted_inputs={t: version_id}) should check whether version_id is cache valid with the code in the currently running repo checkout or whether it leaves this check up to the caller. The most common usecase here is to perform this manually versioned persistence on a flatfile representation. It can be assumed that new code typically just adds columns to the flatfile representation. As a consequence, enforcing strict cache validity would be an annoying restriction.
Enforcing identical position_hash for task that produced t would be another option, but it has the same downsides as checking full cache validity
In case persistence_table_store uses parquet, should the default behavior be that dematerialization for SQL based tasks throws and error, or that it should be tried using duckdb. In combination with a parquet based local_table_cache it would even be possible to execute queries spanning both persisted tables and non-persisted tables.
The text was updated successfully, but these errors were encountered:
Background:
Pydiverse pipedag is currently more used at the beginning of pipelines before training actual models. In the process of increasing adoption further down, a demand was raised for decoupling the experimentation with model training from the early part of preparing the input. The model training part may also include a thin layer of feature engineering which is more about selecting, encoding, and combining base feature data. We call it "feature encoding" here. A general pipeline might be split as follows:
input preparation:
raw ingestion -> cleaning for inspection -> best possible representation for economic reasoning -> feature engineering
model training:
-> feature encoding(s5) -> model training(s6) -> model evaluation(s7)
Idea:
Similar to the local table cache, the pipedag configuration could include another table store called "persistence_table_store". When having a flow
f
and a task outputt
, one could actively create a manually persisted version oft
withversion_id = f.persist([t])
. When executing model training code, one could runf.run([s5,s6,s7], persisted_inputs={t: version_id})
. Assuming the persistence_table_store is configured as parquet storage in a certain directory or s3 bucket, theversion_id
would be a monotonically increasing string (i.e date and time) which makes it reasonably easy to manually delete all persisted versions older than 3 months for example. Since it is done manually, keeping a few or all versions long term is no problem.Interaction:
Questions:
f.run([s5,s6,s7], persisted_inputs={t: version_id})
should check whetherversion_id
is cache valid with the code in the currently running repo checkout or whether it leaves this check up to the caller. The most common usecase here is to perform this manually versioned persistence on a flatfile representation. It can be assumed that new code typically just adds columns to the flatfile representation. As a consequence, enforcing strict cache validity would be an annoying restriction.t
would be another option, but it has the same downsides as checking full cache validityThe text was updated successfully, but these errors were encountered: