Skip to content

Commit

Permalink
changelog: improve description of new datastore/import options
Browse files Browse the repository at this point in the history
  • Loading branch information
hsanjuan committed Dec 18, 2024
1 parent 9604ac1 commit d3593a3
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
3 changes: 2 additions & 1 deletion core/coreapi/unixfs.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@ func (api *UnixfsAPI) Add(ctx context.Context, files files.Node, opts ...options
}

bserv := blockservice.New(addblockstore, exch,
blockservice.WriteThrough(cfg.Datastore.WriteThrough.WithDefault(true))) // hash security 001
blockservice.WriteThrough(cfg.Datastore.WriteThrough.WithDefault(true)),
) // hash security 001
dserv := merkledag.NewDAGService(bserv)

// add a sync call to the DagService
Expand Down
13 changes: 6 additions & 7 deletions docs/changelogs/v0.33.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,19 @@ If you depended on removed ones, please fill an issue to add them to the upstrea

Onboarding files and directories with `ipfs add --to-files` now requires non-empty names. due to this, The `--to-files` and `--wrap` options are now mutually exclusive ([#10612](https://github.com/ipfs/kubo/issues/10612)).

#### New datastore options for faster writes: `WriteThrough`, `BlockKeyCacheSize`
#### New options for faster writes: `WriteThrough`, `BlockKeyCacheSize`, `BatchMaxNodes`, `BatchMaxSize`

Now that Kubo supports [`pebble`](../datastores.md#pebbleds) as a datastore backend, it becomes very useful to expose some additional configuration options for how the blockservice/blockstore/datastore combo behaves.

Usually, LSM-tree based datastore like Pebble or Badger have very fast write performance (blocks are streamed to disk) while incurring in read-amplification penalties (blocks need to be looked up in the index to know where they are on disk). Prior to this version, `BlockService` and `Blockstore` implementations performed a `Has(cid)` for every block that was going to be written, skipping the writes altogether if the block was already present in the datastore.
Usually, LSM-tree based datastore like Pebble or Badger have very fast write performance (blocks are streamed to disk) while incurring in read-amplification penalties (blocks need to be looked up in the index to know where they are on disk), specially noticiable on spinning disks.

The performance impact of this `Has()` call can vary. The `Datastore` implementation might include block-caching and things like bloom-filters to speed up lookups and mitigate read-penalties. Our `Blockstore` implementation also includes a bloom-filter (controlled by `BloomFilterSize`, and disabled by default), and a two-queue cache for keys and block sizes. If we assume that most of the blocks added to Kubo are new blocks, not already present in the datastore, or that the datastore itself includes mechanisms to optimize writes and avoid writing the same data twice, the calls to `Has()` at both BlockService and Blockstore layers seem superflous and we have seen it harm performance when importing large amounts of data.
Prior to this version, `BlockService` and `Blockstore` implementations performed a `Has(cid)` for every block that was going to be written, skipping the writes altogether if the block was already present in the datastore. The performance impact of this `Has()` call can vary. The `Datastore` implementation itself might include block-caching and things like bloom-filters to speed up lookups and mitigate read-penalties. Our `Blockstore` implementation also supports a bloom-filter (controlled by `BloomFilterSize` and disabled by default), and a two-queue cache for keys and block sizes. If we assume that most of the blocks added to Kubo are new blocks, not already present in the datastore, or that the datastore itself includes mechanisms to optimize writes and avoid writing the same data twice, the calls to `Has()` at both BlockService and Blockstore layers seem superflous to they point they even harm write performance.

For these reasons, from now on, the default is to use "write through" implementation of Blockservice/Blockstore. We have added a new option `Datastore.WriteThrough`, which defaults to `true`. Previous behaviour can be obtained by switching it to `false`.
For these reasons, from now on, the default is to use a "write-through" mode for the Blockservice and the Blockstore. We have added a new option `Datastore.WriteThrough`, which defaults to `true`. Previous behaviour can be obtained by manually setting it to `false`.

We have additionally made the size of the two-queue blockstore cache with another option: `Datastore.BlockKeyCacheSize` which defaults to `65536` (64KiB). This option does not appear on the configuration by default, but it can be set manually and also allows to disable this caching layer by setting it to `0`.
We have also made the size of the two-queue blockstore cache configurable with another option: `Datastore.BlockKeyCacheSize`, which defaults to `65536` (64KiB). Additionally, this caching layer can be disabled altogether by setting it to `0`. In particular, this option controls the size of a blockstore caching layer that records whether the blockstore has certain block and their sizes (but does not cache the contents, so it stays relativey small in general).

This option controls the size of a blockstore caching layer that records whether the blockstore has certain block and their sizes (not the contents). This was previously an internal option. It is set by default to 64KiB.
This caching layer can be disabled by setting it to `0`. This option is similar to the existing `BloomFilterSize`, which creates another bloom-filter-based wrapper on the blockstore.
Finally, we have added two new options to the `Import` section to control the maximum size of write-batches: `BatchMaxNodes` and `BatchMaxSize`. These are set by default to `128` nodes and `20MiB`. Increasing them will batch more items together when importing data with `ipfs dag import`, which can speed things up. It is importance to find a balance between available memory (used to hold the batch), disk latencies (when writing the batch) and processing power (when preparing the batch, as nodes are sorted and duplicates removed).

As a reminder, details from all the options are explained in the [configuration documentation](../config.md).

Expand Down

0 comments on commit d3593a3

Please sign in to comment.