CNDB-9104: Port over chunk cache improvements from DSE #1495

blambov · 2025-01-09T14:38:08Z

What is the issue

Using buffers of different size in the chunk cache causes fragmentation, which in turn results in excessive memory use and lack of pooling for a large fraction of the buffers used by the chunk cache.

What does this PR fix and why was it fixed

Ports over single-size chunk cache buffers (DB-2904), caching memory addresses (parts of DB-2509) and file cache ids (DB-2489) from DSE.

This does not port any of the BufferPool refactoring in DSE. As C* already has distinct buffer pools for short vs. longer-term buffers, we should already be receiving similar benefits.

The size of the per-entry on-heap overhead of the chunk cache is reduced from ~350 bytes to ~220. As part of this reduction, the patch drops the collection of lists of keys per file and replaces it with the ability to drop a file id for invalidation, making a file's entries in the cache unreachable and reclaimed with some delay using the normal cache eviction.

Before this patch the cache could use on-heap memory if this was the preference of the compressor in use (e.g. Deflate specifies an ON_HEAP preference). This is highly unexpected and put very low limits on the useable cache size. The cache is now changed to always store data off heap.

Also changes the source of some temporary buffers to the short lived "networking" pool.

Checklist before you submit for review

Make sure there is a PR in the CNDB project updating the Converged Cassandra version
Use NoSpamLogger for log lines that may appear frequently in the logs
Verify test results on Butler
Test coverage for new/modified code is > 80%
Proper code formatting
Proper title for each commit staring with the project-issue number, like CNDB-1234
Each commit has a meaningful description
Each commit is not very long and contains related changes
Renames, moves and reformatting are in distinct commits

Using CachingRebuffererTest.calculateMemoryOverhead with 1.5M entries. Cache size set at 4096 MiB. Bytes on heap per entry: 320

Saves at least 40 bytes per cache entry (12.5%) and 20% of the insertion time. Bytes on heap per entry: 280

Bytes on heap per entry: 232

With fileIDs this has no effect on the performance of the cache Bytes on heap per entry: 222

This reverts commit be317b2.

This reverts commit 1a15fa5.

src/java/org/apache/cassandra/cache/ChunkCache.java

sonarqubecloud · 2025-01-10T14:37:13Z

Quality Gate failed

Failed conditions
57.8% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

sonarqubecloud · 2025-01-10T14:41:28Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…normally-allocated direct byte buffer.

pcmanus · 2025-01-10T15:04:28Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+    private final ConcurrentHashMap<File, Long> fileIdMap = new ConcurrentHashMap<>();
+    private final AtomicLong nextFileId = new AtomicLong(0);
+
+    // number of bits required to store the log2 of the chunk size (highestOneBit(highestOneBit(Integer.MAX_VALUE)))


Nit: I don't think you meant to repeat highestOnBit twice in that comment (and highestOneBit(highestOneBit(Integer.MAX_VALUE)) is not 5).

pcmanus · 2025-01-10T15:16:26Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+    private final static int CHUNK_SIZE_LOG2_BITS = 5;
+
+    // number of bits required to store the ready type
+    private final static int READER_TYPE_BITS = Integer.highestOneBit(ChunkReader.ReaderType.COUNT - 1);


Slightly confused by the logic of this. I works here because ChunkReader.ReaderType.COUNT is 2, but say it was 5, then highestOnBit(4) is 4, but in theory you only need 3 bits to 5 values. Did you meant to put the -1 after the call to highestOneBit?

pcmanus · 2025-01-10T15:24:53Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+     * are occupied by log 2 of chunk size (we assume the chunk size is the power of 2), and the rest of the bits
+     * are occupied by fileId counter which is incremented with for each unseen file name.
+     */
+    protected long fileIdFor(File file, ChunkReader.ReaderType type, int chunkSize)


Nit: there is tad of overloading of "fileId": the fileId returned here is not quire the same thing as what assignFileId return technically. Not a big deal, but having slightly separate naming could save some future error (some new code using the result of fileIdMap directly).

pcmanus · 2025-01-10T15:33:49Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+        if (buffers.length > 1)
+            return new MultiRegionChunk(position, buffers);
+        else
+            return new SingleRegionChunk(position, buffers[0]);
    }

    public ChunkCache(BufferPool pool, int cacheSizeInMB, Function<ChunkCache, ChunkCacheMetrics> createMetrics)


Nit: I personally find it a tad inconvenient to have the ctor so deep within the class, would prefer moving this at the beginning (not new to those changes admittedly, but there is a lot of code before this now.

pcmanus · 2025-01-10T15:38:16Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+     * @param chunkSize the amount of data to read
+     * @param pages an array of page-sized memory allocations whose total capacity needs to be >= chunkSize
+     */
+    int readScattering(ChunkReader file, long position, int chunkSize, long[] pages)


Any reason this isn't part of MultiRegionChunk (even maybe just inlined into MultiRegionChunk.read)? I would have expected to see it there.

pcmanus · 2025-01-10T16:11:00Z

src/java/org/apache/cassandra/io/util/SequentialWriter.java

@@ -124,6 +125,9 @@ private static FileChannel openChannel(File file)
                    try { channel.close(); }
                    catch (Throwable t2) { t.addSuppressed(t2); }
                }
+
+                // Invalidate any cache entries that may exist for a previous file with the same name.
+                ChunkCache.instance.invalidateFile(file);


Most importantly, ChunkCache.instance can be null, which would break here.

Also, I don't have a super good alternative in mind at the moment, but it does feel rather fragile to me that we have to call this everywhere were we write files that may be used by the chunk cache. If some code, especially in cndb, decided to write files by another mean in some special case, this would be really easy to forget/get wrong (side note: I'm not saying it's the case right now, I don't think it is, but I'm also really not 100% sure).

Admittedly a half baked idea, but couldn't we, say, shove the file modification file into the cache fileId so we don't need to do this?

pcmanus · 2025-01-10T16:16:43Z

src/java/org/apache/cassandra/utils/memory/BufferPool.java

@@ -130,6 +131,7 @@ public class BufferPool
    private static final Logger logger = LoggerFactory.getLogger(BufferPool.class);
    private static final NoSpamLogger noSpamLogger = NoSpamLogger.getLogger(logger, 15L, TimeUnit.MINUTES);
    private static final ByteBuffer EMPTY_BUFFER = ByteBuffer.allocateDirect(0);
+    private static boolean DISABLE_COMBINED_ALLOCATION = Boolean.getBoolean("cassandra.bufferpool.disable_combined_allocation");


Nit: can be final?

pcmanus · 2025-01-10T16:26:43Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+    }
+
+    /**
+     * A chunk with a single memory region. This can be used for reading chunks of up to PageAware.PAGE_SIZE but note


The "This can be used for reading chunks of up to PageAware.PAGE_SIZE" sentence here is a bit confusing, given we use this for reading larger chunks when we're able to get contiguous regions.

pcmanus · 2025-01-10T16:44:30Z

src/java/org/apache/cassandra/cache/ChunkCache.java

+        if (chunkSize < PageAware.PAGE_SIZE)
+            return new SingleRegionChunk(position, bufferPool.get(PageAware.PAGE_SIZE, BufferType.OFF_HEAP).limit(chunkSize).slice());
+
+        ByteBuffer[] buffers = bufferPool.getMultiple(chunkSize, PageAware.PAGE_SIZE, BufferType.OFF_HEAP);


This code relies on BufferPool.getMultiple behaving in a ways that are not well seemingly guarantees by the javadoc of that method. That is, the javadoc of getMultiple is:

/** * Allocate the given amount of memory, where the caller can accept the space to be split into multiple buffers. * * @param totalSize the total size to be allocated * @param chunkSize the minimum size of each buffer returned * * @return an array of allocated buffers */

but that only say that chunkSize is a minimum buffer size and that thing may or may not be split in multiple buffers. But here, the code relies on getMultiple behaving in only one of 2 ways:

it returns a single buffer with the full allocation.

it return buffers that are all exactly of size chunkSize.

I'll note in particular that the 2nd point is relied on by MultiRegionChunk.getBuffer, which relies on each buffer to be exactly PageAware.PAGE_SIZE.

Now, I'm not 100% if getMultiple always behave as we rely on here, because I don't quite get the difference between BufferPool.get and BufferPool.getAtLeast, but at a minimum, I think we should clarify the stronger specification of getMultiple that we actually rely on.

pcmanus · 2025-01-10T16:54:19Z

src/java/org/apache/cassandra/utils/memory/BufferPool.java

+
+        while (remainingSize >= chunkSize)
+        {
+            buffers[idx] = pool.getAtLeast(chunkSize);


For my own curiosity, what is the rational for when to use getAtLeast versus get? It seems that getAtLeast still return a buffer whose capacity has been set to chunkSize (or does it not?) and if so, why do we use get instead of getAtLeast in the last call below?

cassci-bot · 2025-01-10T17:19:57Z

❌ Build ds-cassandra-pr-gate/PR-1495 rejected by Butler

341 new test failure(s) in 5 builds
See build details here

Found 341 new test failures

Showing only first 15 new test failures

Test	Explanation	Branch history	Upstream history
...buted.test.IncrementalRepairCoordinatorFastTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
....test.PreviewRepairCoordinatorNeighbourDownTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...ibuted.test.PreviewRepairCoordinatorTimeoutTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...rg.apache.cassandra.distributed.test.RepairTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...distributed.test.UnableToParseClientMessageTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...ributed.test.sai.datamodels.QueryTimeToLiveTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...ted.test.sai.datamodels.QueryWriteLifecycleTest	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
r.TestReadRepairGuarantees.test_atomic_writes[n...	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
r.TestAllowFiltering.test_update_on_collection	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...ControllerConfigTest.testVectorControllerConfig	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...roupByTest.groupByWithDeletesAndSrpOnPartitions	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...sRepairStreamingTest.testWithCompressionEnabled	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...ToolEnableDisableBinaryTest.testMaybeChangeDocs	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
o.a.c.d.t.OptimiseStreamsRepairTest.testBasic	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵
...: provision strategy=MultipleNetworkInterfaces]	regression	🔴🔴🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵

Found 36 known test failures

blambov added 4 commits January 9, 2025 12:25

Remove unused BufferHolder methods

3968200

DB-2904 port: Use same size buffers in chunk cache

1ad3ac4

Always use off-heap memory for chunk cache.

cfa0842

Use networking buffer pool for compressed reads.

0f84fc2

eolivelli requested review from pcmanus, jasonstack and eolivelli January 9, 2025 14:49

blambov added 9 commits January 10, 2025 12:09

Allow buffer pool to return one-buffer multi-page chunks

31a5fda

Port over some chunk cache tests

6aa0130

Set up for on-heap memory usage test

1a15fa5

Using CachingRebuffererTest.calculateMemoryOverhead with 1.5M entries. Cache size set at 4096 MiB. Bytes on heap per entry: 320

Introduce fileID and invalidate file by dropping id.

c4f14ee

Saves at least 40 bytes per cache entry (12.5%) and 20% of the insertion time. Bytes on heap per entry: 280

Store addresses and attachments to avoid a direct buffer per entry

307aa7f

Bytes on heap per entry: 232

Sleep for jmap

be317b2

Remove pre-computed key hash

1313839

With fileIDs this has no effect on the performance of the cache Bytes on heap per entry: 222

Revert "Sleep for jmap"

36be7e5

This reverts commit be317b2.

Revert "Set up for on-heap memory usage test"

9190348

This reverts commit 1a15fa5.

blambov force-pushed the CNDB-9104 branch from 1207559 to 9190348 Compare January 10, 2025 10:49

eolivelli reviewed Jan 10, 2025

View reviewed changes

src/java/org/apache/cassandra/cache/ChunkCache.java Outdated Show resolved Hide resolved

src/java/org/apache/cassandra/cache/ChunkCache.java Outdated Show resolved Hide resolved

Review changes and license fix

d4e230e

blambov requested a review from michaeljmarshall January 10, 2025 13:05

blambov added 3 commits January 10, 2025 15:30

Drop ChunkReader reference from Key

d5a22c8

Revert unneeded change

fa2551f

Test improvements

89817b4

blambov added 2 commits January 10, 2025 18:29

Use page splitting for large buffers too, to avoid having to store a …

c0c4716

…normally-allocated direct byte buffer.

Fix test.

7ab70b0

pcmanus reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDB-9104: Port over chunk cache improvements from DSE #1495

CNDB-9104: Port over chunk cache improvements from DSE #1495

blambov commented Jan 9, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 10, 2025

sonarqubecloud bot commented Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

pcmanus Jan 10, 2025

cassci-bot commented Jan 10, 2025

CNDB-9104: Port over chunk cache improvements from DSE #1495

Are you sure you want to change the base?

CNDB-9104: Port over chunk cache improvements from DSE #1495

Conversation

blambov commented Jan 9, 2025 • edited Loading

What is the issue

What does this PR fix and why was it fixed

Checklist before you submit for review

sonarqubecloud bot commented Jan 10, 2025

Quality Gate failed

sonarqubecloud bot commented Jan 10, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cassci-bot commented Jan 10, 2025

❌ Build ds-cassandra-pr-gate/PR-1495 rejected by Butler

Found 341 new test failures

Found 36 known test failures

blambov commented Jan 9, 2025 •

edited

Loading