feat: new `Snapshot::new_from()` API #549

zachschuermann · 2024-11-27T20:52:30Z

Few quick changes:

new Snapshot::new_from(old_snapshot) API to allow for optimization of creating a new snapshot when there's an old one lying around. for now, just doing the dumb thing and passing to the old API, but this will allow us to begin using the new API and later optimizing it which will then benefit all the callsites using it.
I went ahead and made a small change to leverage this API in a known spot in the table_changes module.
Small new test with the new API

resolves #489

zachschuermann · 2024-11-27T20:52:53Z

curious if anyone has naming thoughts!

codecov · 2024-11-27T20:56:48Z

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.64%. Comparing base (953ceed) to head (f8bc074).

Files with missing lines	Patch %	Lines
kernel/src/table_changes/mod.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #549      +/-   ##
==========================================
+ Coverage   80.61%   80.64%   +0.03%     
==========================================
  Files          67       67              
  Lines       14278    14303      +25     
  Branches    14278    14303      +25     
==========================================
+ Hits        11510    11535      +25     
  Misses       2188     2188              
  Partials      580      580

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich

A general question I have: What should a listing optimization even look like for a snapshot refresh? If the snapshot is not very old, then we should just LIST to find new commit .json after the end of the current segment, and not even try to find new checkpoints. Quick, easy.

Also, the "append new deltas" approach is friendly to the "partial P&M query" optimization, which is only applicable if we have a contiguous chain of commits back to the previous snapshot version -- a newer checkpoint would actually force us to do the full P&M query all over, which for a large checkpoint could be annoying.

On the other hand, if there is a newer checkpoint available, then data skipping will be more efficient if we use it (fewer jsons to replay serially and keep track of). This is especially true if a lot of versions have landed since the original snapshot was taken.

Problem is, there's no way to know in advance whether the snapshot is "stale" because it's by number of versions that land, not elapsed time.

Complicated stuff...

scovich · 2024-11-27T21:14:55Z

kernel/src/snapshot.rs

+        existing_snapshot: &Snapshot,
+        engine: &dyn Engine,
+        version: Option<Version>,
+    ) -> DeltaResult<Self> {


Seems like the method should take+return Arc<Snapshot> so we have the option to return the same snapshot if we determine it is still fresh?

Maybe even do

pub fn refresh(self: &Arc<Self>, ...) -> DeltaResult<Arc<Self>>

(this would have slightly different intuition than new_from -- refresh specifically assumes I want a newer snapshot, if available, and attempting to request an older version may not even be legal; I'm not sure if it would even make sense to pass an upper bound version for a refresh operation)

scovich · 2024-11-27T21:30:50Z

kernel/src/table_changes/mod.rs

@@ -90,7 +90,7 @@ impl TableChanges {
        // supported for every protocol action in the CDF range.
        let start_snapshot =
            Snapshot::try_new(table_root.as_url().clone(), engine, Some(start_version))?;
-        let end_snapshot = Snapshot::try_new(table_root.as_url().clone(), engine, end_version)?;
+        let end_snapshot = Snapshot::new_from(&start_snapshot, engine, end_version)?;


This opens an interesting question... if we knew that new_from would reuse the log checkpoint and just "append" any new commit .json files to the log segment, then we could almost (**) reuse that log segment for the CDF replay by just stripping out its checkpoint files? But that's pretty CDF specific; in the normal case we want a refresh to use the newest checkpoint available because it makes data skipping log replay cheaper. Maybe the CDF case needs a completely different way of creating the end_snapshot, unrelated to this optimization here.

(**) Almost, because the start version might have a checkpoint, in which case stripping the checkpoint out of the log segment would also remove the start version. But then again, do we actually want the older snapshot to be the start version? Or the previous version which the start version is making changes to? Or, maybe we should just restrict the checkpoint search to versions before the start version, specifically so that this optimization can work.

do we actually want the older snapshot to be the start version?

It would be sufficient to have the older snapshot be start_version-1 as long as we also have access to the commit at start_version. With these, we would start P&M at start_version then continue it on the older snapshot if we don't find anything.

I guess this would look like: snapshot(start_version-1).refresh_with_commits(end_version)

After all, the goal of the start_snapshot is just to ensure that CDF is enabled.

OussamaSaoudi-db · 2024-11-29T18:41:27Z

kernel/src/snapshot.rs

@@ -71,6 +72,26 @@ impl Snapshot {
        Self::try_new_from_log_segment(table_root, log_segment, engine)
    }

+    /// Create a new [`Snapshot`] instance from an existing [`Snapshot`]. This is useful when you
+    /// already have a [`Snapshot`] lying around and want to do the minimal work to 'update' the
+    /// snapshot to a later version.


Just to clarify, is this api only for versions later than the existing snapshot?

new Snapshot::new_from() API

f8bc074

github-actions bot assigned zachschuermann Nov 27, 2024

zachschuermann requested review from scovich, nicklan and OussamaSaoudi-db and removed request for scovich and nicklan November 27, 2024 20:52

scovich reviewed Nov 27, 2024

View reviewed changes

OussamaSaoudi-db reviewed Nov 29, 2024

View reviewed changes

zachschuermann mentioned this pull request Jan 21, 2025

feat: incremental Snapshot update #651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: new `Snapshot::new_from()` API #549

feat: new `Snapshot::new_from()` API #549

zachschuermann commented Nov 27, 2024 •

edited

Loading

zachschuermann commented Nov 27, 2024

codecov bot commented Nov 27, 2024

scovich left a comment

scovich Nov 27, 2024

scovich Nov 27, 2024

scovich Nov 27, 2024

OussamaSaoudi-db Nov 29, 2024

OussamaSaoudi-db Nov 29, 2024

feat: new Snapshot::new_from() API #549

Are you sure you want to change the base?

feat: new Snapshot::new_from() API #549

Conversation

zachschuermann commented Nov 27, 2024 • edited Loading

zachschuermann commented Nov 27, 2024

codecov bot commented Nov 27, 2024

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

scovich Nov 27, 2024

Choose a reason for hiding this comment

scovich Nov 27, 2024

Choose a reason for hiding this comment

scovich Nov 27, 2024

Choose a reason for hiding this comment

OussamaSaoudi-db Nov 29, 2024

Choose a reason for hiding this comment

OussamaSaoudi-db Nov 29, 2024

Choose a reason for hiding this comment

feat: new `Snapshot::new_from()` API #549

feat: new `Snapshot::new_from()` API #549

zachschuermann commented Nov 27, 2024 •

edited

Loading