HDDS-11894. PreAllocate and Cache Blocks in OM #7550

tanvipenumudy · 2024-12-09T21:54:39Z

What changes were proposed in this pull request?

[WIP] The design and implementation details have been added to the Jira ticket: HDDS-11894

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11894

How was this patch tested?

Manually tested on a local setup and an in-house cluster.
Existing integration and acceptance tests flow

errose28 · 2024-12-09T23:56:37Z

Thanks for working on this @tanvipenumudy. This looks like a medium sized improvement, so it might be better to break HDDS-11894 down into subtasks since the code change here is almost 2k lines. PR for the top level Jira can be a markdown design doc instead of google doc for better visibility and easier review.

tanvipenumudy · 2024-12-10T07:13:36Z

Sure @errose28, I believe most of the code changes are coming from Grafana dashboards and metrics-related files. I’ll create separate tasks for each and move out these changes, thanks!

sodonnel · 2024-12-10T15:21:34Z

I added some comments to the design. I feel the implementation should wait on a more fully featured design, as there are probably some areas that need further thought.

sodonnel · 2024-12-10T15:23:49Z

...op-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/client/OMBlockPrefetchClient.java

+  private static final ReplicationConfig RS_6_3_1024 =
+      ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null,
+          toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024));
+  private static final ReplicationConfig XOR_10_4_4096 =


EC can have RS_10_4_1024k.

I would suggest creating these queues on demand as potentially there are an infinite number of EC configurations possible with different chunk sizes. That said, we have only tested 1024k, but in theory any chunksize is possible and the system does not prevent different EC types from being used, so we should avoid adding any code that would prevent it.

Thank you @sodonnel for pointing this out, will be incorporating the creation of these queues on demand!

Could you please clarify a few questions @sodonnel:

Can the other parameters within the EC replication configuration vary alongside the chunk size?

Can we impose specific limitations over the chunk size variations, or are they indefinite?

Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?

Can the other parameters within the EC replication configuration vary alongside the chunk size?

We have only really tested 3-2, 6-3 and 10-4, but any combination of data and parity within reason is theoretically possible. Therefore we should not impose limits in other parts of the code that may break things in the future.

Can we impose specific limitations over the chunk size variations, or are they indefinite?

In theory, they are indefinite, but out of the box, there is a config that limits the EC schemes to the few we have tested, ie 3-2, 6-3, 10-4 with 1024k chunksize, but by overriding the config a user can do whatever they want.

Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?

Yea, each EC scheme has a different pool of pipelines. Also an EC pipeline is only ever used for a single container. The pipelines are not long lived.

For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.

We should still not have a finite list of EC schemes hard coded. Ignoring chunksize it possible for someone to use any combination of data-parity, and you don't want to have to change this code if someone starts using rs-4-4 or 3-3 or 15-3, or whatever else they may come up with.

kerneltime

Include a screen capture for the dashboard.

kerneltime · 2025-01-07T21:18:33Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java

+  public static final String OZONE_OM_PREFETCH_MAX_BLOCKS = "ozone.om.prefetch.max.blocks";
+  public static final int OZONE_OM_PREFETCH_MAX_BLOCKS_DEFAULT = 10000;
+
+  public static final String OZONE_OM_PREFETCH_BLOCKS_FACTOR = "ozone.om.prefetch.blocks.factor";
+  public static final int OZONE_OM_PREFETCH_BLOCKS_FACTOR_DEFAULT = 2;


Add comments for what these configurations are.

kerneltime · 2025-01-07T21:29:33Z

...op-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/client/OMBlockPrefetchClient.java

+  private static final ReplicationConfig RS_6_3_1024 =
+      ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null,
+          toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024));
+  private static final ReplicationConfig XOR_10_4_4096 =


For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.

kerneltime · 2025-01-07T21:44:56Z

...op-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/client/OMBlockPrefetchClient.java

+                             ReplicationConfig replicationConfig,
+                             String serviceID, ExcludeList excludeList) {
+    ConcurrentLinkedDeque<ExpiringAllocatedBlock> queue = blockQueueMap.get(replicationConfig);
+    prefetchExecutor.submit(() -> {


Measure the time it takes the executor to run the lambda.

kerneltime · 2025-01-07T21:59:20Z

...op-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/client/OMBlockPrefetchClient.java

+    int remainingBlocks = numBlocks - retrievedBlocksCount;
+    if (remainingBlocks > 0) {
+      List<AllocatedBlock> newBlocks = scmBlockLocationProtocol.allocateBlock(
+          scmBlockSize, remainingBlocks, replicationConfig, serviceID, excludeList, clientMachine);


Can we overallocate and populate the cache here as well?

kerneltime · 2025-01-07T22:01:11Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyRequest.java

+      allocatedBlocks = prefetchClient.getBlocks(scmBlockSize, numBlocks, replicationConfig, serviceID, excludeList,
+          clientMachine, clusterMap);
+
+      prefetchClient.prefetchBlocks(scmBlockSize, numBlocks * blockPrefetchFactor, replicationConfig, serviceID,


Measure the time it takes for prefetchBlocks the expectation is that this should be really quick as we fork off a new thread but goog to validate.

Mesure the overall allocateBlock call as well.

Penumudy Tanvi added 3 commits December 10, 2024 03:15

HDDS-11894. PreAllocate and Cache Blocks in OM

c31a8f5

fix checkstyle

29f29d6

fix checkstyle

9a12024

Penumudy Tanvi added 3 commits December 10, 2024 12:52

fix logical errors

4b584ba

Mock OMBlockPrefetchClient in TestOMKeyRequest

fff2a54

Calculate the expiryTime only once

ad6aad1

sodonnel reviewed Dec 10, 2024

View reviewed changes

kerneltime added performance om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897 labels Jan 7, 2025

kerneltime self-requested a review January 7, 2025 06:48

kerneltime reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11894. PreAllocate and Cache Blocks in OM #7550

HDDS-11894. PreAllocate and Cache Blocks in OM #7550

tanvipenumudy commented Dec 9, 2024

errose28 commented Dec 9, 2024

tanvipenumudy commented Dec 10, 2024

sodonnel commented Dec 10, 2024

sodonnel Dec 10, 2024

tanvipenumudy Dec 16, 2024

tanvipenumudy Dec 17, 2024

sodonnel Dec 17, 2024

kerneltime Jan 7, 2025

sodonnel Jan 8, 2025

kerneltime left a comment

kerneltime Jan 7, 2025

kerneltime Jan 7, 2025

kerneltime Jan 7, 2025

kerneltime Jan 7, 2025

kerneltime Jan 7, 2025

kerneltime Jan 7, 2025

HDDS-11894. PreAllocate and Cache Blocks in OM #7550

Are you sure you want to change the base?

HDDS-11894. PreAllocate and Cache Blocks in OM #7550

Conversation

tanvipenumudy commented Dec 9, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

errose28 commented Dec 9, 2024

tanvipenumudy commented Dec 10, 2024

sodonnel commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerneltime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment