-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11894. PreAllocate and Cache Blocks in OM #7550
base: master
Are you sure you want to change the base?
Conversation
Thanks for working on this @tanvipenumudy. This looks like a medium sized improvement, so it might be better to break HDDS-11894 down into subtasks since the code change here is almost 2k lines. PR for the top level Jira can be a markdown design doc instead of google doc for better visibility and easier review. |
Sure @errose28, I believe most of the code changes are coming from Grafana dashboards and metrics-related files. I’ll create separate tasks for each and move out these changes, thanks! |
I added some comments to the design. I feel the implementation should wait on a more fully featured design, as there are probably some areas that need further thought. |
private static final ReplicationConfig RS_6_3_1024 = | ||
ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null, | ||
toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024)); | ||
private static final ReplicationConfig XOR_10_4_4096 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC can have RS_10_4_1024k.
I would suggest creating these queues on demand as potentially there are an infinite number of EC configurations possible with different chunk sizes. That said, we have only tested 1024k, but in theory any chunksize is possible and the system does not prevent different EC types from being used, so we should avoid adding any code that would prevent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sodonnel for pointing this out, will be incorporating the creation of these queues on demand!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify a few questions @sodonnel:
- Can the other parameters within the EC replication configuration vary alongside the chunk size?
- Can we impose specific limitations over the chunk size variations, or are they indefinite?
- Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the other parameters within the EC replication configuration vary alongside the chunk size?
We have only really tested 3-2, 6-3 and 10-4, but any combination of data and parity within reason is theoretically possible. Therefore we should not impose limits in other parts of the code that may break things in the future.
Can we impose specific limitations over the chunk size variations, or are they indefinite?
In theory, they are indefinite, but out of the box, there is a config that limits the EC schemes to the few we have tested, ie 3-2, 6-3, 10-4 with 1024k chunksize, but by overriding the config a user can do whatever they want.
Will replication configurations with different chunk sizes require separate pipelines, or can they share the same pipeline internally?
Yea, each EC scheme has a different pool of pipelines. Also an EC pipeline is only ever used for a single container. The pipelines are not long lived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still not have a finite list of EC schemes hard coded. Ignoring chunksize it possible for someone to use any combination of data-parity, and you don't want to have to change this code if someone starts using rs-4-4 or 3-3 or 15-3, or whatever else they may come up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a screen capture for the dashboard.
public static final String OZONE_OM_PREFETCH_MAX_BLOCKS = "ozone.om.prefetch.max.blocks"; | ||
public static final int OZONE_OM_PREFETCH_MAX_BLOCKS_DEFAULT = 10000; | ||
|
||
public static final String OZONE_OM_PREFETCH_BLOCKS_FACTOR = "ozone.om.prefetch.blocks.factor"; | ||
public static final int OZONE_OM_PREFETCH_BLOCKS_FACTOR_DEFAULT = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments for what these configurations are.
private static final ReplicationConfig RS_6_3_1024 = | ||
ReplicationConfig.fromProto(HddsProtos.ReplicationType.EC, null, | ||
toProto(6, 3, ECReplicationConfig.EcCodec.RS, 1024)); | ||
private static final ReplicationConfig XOR_10_4_4096 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR we can limit the caching to only chunk sizes that have been tested. In the longer term, we should stop having pipelines be dependent on the chunk size but only be dependent on selection of data and parity nodes across all chunk sizes.
ReplicationConfig replicationConfig, | ||
String serviceID, ExcludeList excludeList) { | ||
ConcurrentLinkedDeque<ExpiringAllocatedBlock> queue = blockQueueMap.get(replicationConfig); | ||
prefetchExecutor.submit(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measure the time it takes the executor to run the lambda.
int remainingBlocks = numBlocks - retrievedBlocksCount; | ||
if (remainingBlocks > 0) { | ||
List<AllocatedBlock> newBlocks = scmBlockLocationProtocol.allocateBlock( | ||
scmBlockSize, remainingBlocks, replicationConfig, serviceID, excludeList, clientMachine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we overallocate and populate the cache here as well?
allocatedBlocks = prefetchClient.getBlocks(scmBlockSize, numBlocks, replicationConfig, serviceID, excludeList, | ||
clientMachine, clusterMap); | ||
|
||
prefetchClient.prefetchBlocks(scmBlockSize, numBlocks * blockPrefetchFactor, replicationConfig, serviceID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measure the time it takes for prefetchBlocks
the expectation is that this should be really quick as we fork off a new thread but goog to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mesure the overall allocateBlock
call as well.
What changes were proposed in this pull request?
[WIP] The design and implementation details have been added to the Jira ticket: HDDS-11894
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11894
How was this patch tested?