Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11894. PreAllocate and Cache Blocks in OM #7550

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tanvipenumudy
Copy link
Contributor

What changes were proposed in this pull request?

[WIP] The design and implementation details have been added to the Jira ticket: HDDS-11894

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11894

How was this patch tested?

  • Manually tested on a local setup and an in-house cluster.
  • Existing integration and acceptance tests flow

@errose28
Copy link
Contributor

errose28 commented Dec 9, 2024

Thanks for working on this @tanvipenumudy. This looks like a medium sized improvement, so it might be better to break HDDS-11894 down into subtasks since the code change here is almost 2k lines. PR for the top level Jira can be a markdown design doc instead of google doc for better visibility and easier review.

@tanvipenumudy
Copy link
Contributor Author

Sure @errose28, I believe most of the code changes are coming from Grafana dashboards and metrics-related files. I’ll create separate tasks for each and move out these changes, thanks!

@sodonnel
Copy link
Contributor

I added some comments to the design. I feel the implementation should wait on a more fully featured design, as there are probably some areas that need further thought.

private static final ReplicationConfig RS_6_3_1024 =
ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null,
toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024));
private static final ReplicationConfig XOR_10_4_4096 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC can have RS_10_4_1024k.

I would suggest creating these queues on demand as potentially there are an infinite number of EC configurations possible with different chunk sizes. That said, we have only tested 1024k, but in theory any chunksize is possible and the system does not prevent different EC types from being used, so we should avoid adding any code that would prevent it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sodonnel for pointing this out, will be incorporating the creation of these queues on demand!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify a few questions @sodonnel:

  1. Can the other parameters within the EC replication configuration vary alongside the chunk size?
  2. Can we impose specific limitations over the chunk size variations, or are they indefinite?
  3. Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the other parameters within the EC replication configuration vary alongside the chunk size?

We have only really tested 3-2, 6-3 and 10-4, but any combination of data and parity within reason is theoretically possible. Therefore we should not impose limits in other parts of the code that may break things in the future.

Can we impose specific limitations over the chunk size variations, or are they indefinite?

In theory, they are indefinite, but out of the box, there is a config that limits the EC schemes to the few we have tested, ie 3-2, 6-3, 10-4 with 1024k chunksize, but by overriding the config a user can do whatever they want.

Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?

Yea, each EC scheme has a different pool of pipelines. Also an EC pipeline is only ever used for a single container. The pipelines are not long lived.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still not have a finite list of EC schemes hard coded. Ignoring chunksize it possible for someone to use any combination of data-parity, and you don't want to have to change this code if someone starts using rs-4-4 or 3-3 or 15-3, or whatever else they may come up with.

@kerneltime kerneltime added performance om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897 labels Jan 7, 2025
@kerneltime kerneltime self-requested a review January 7, 2025 06:48
Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a screen capture for the dashboard.

Comment on lines +700 to +704
public static final String OZONE_OM_PREFETCH_MAX_BLOCKS = "ozone.om.prefetch.max.blocks";
public static final int OZONE_OM_PREFETCH_MAX_BLOCKS_DEFAULT = 10000;

public static final String OZONE_OM_PREFETCH_BLOCKS_FACTOR = "ozone.om.prefetch.blocks.factor";
public static final int OZONE_OM_PREFETCH_BLOCKS_FACTOR_DEFAULT = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments for what these configurations are.

private static final ReplicationConfig RS_6_3_1024 =
ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null,
toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024));
private static final ReplicationConfig XOR_10_4_4096 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.

ReplicationConfig replicationConfig,
String serviceID, ExcludeList excludeList) {
ConcurrentLinkedDeque<ExpiringAllocatedBlock> queue = blockQueueMap.get(replicationConfig);
prefetchExecutor.submit(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measure the time it takes the executor to run the lambda.

int remainingBlocks = numBlocks - retrievedBlocksCount;
if (remainingBlocks > 0) {
List<AllocatedBlock> newBlocks = scmBlockLocationProtocol.allocateBlock(
scmBlockSize, remainingBlocks, replicationConfig, serviceID, excludeList, clientMachine);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we overallocate and populate the cache here as well?

allocatedBlocks = prefetchClient.getBlocks(scmBlockSize, numBlocks, replicationConfig, serviceID, excludeList,
clientMachine, clusterMap);

prefetchClient.prefetchBlocks(scmBlockSize, numBlocks * blockPrefetchFactor, replicationConfig, serviceID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measure the time it takes for prefetchBlocks the expectation is that this should be really quick as we fork off a new thread but goog to validate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mesure the overall allocateBlock call as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897 performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants