Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11891. Design doc for find Block missing Key tool #7548

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xichen01
Copy link
Contributor

@xichen01 xichen01 commented Dec 9, 2024

What changes were proposed in this pull request?

Design doc for find Block missing Key tool

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11891

How was this patch tested?

N/A

@sumitagrawl
Copy link
Contributor

@xichen01 I have few questions ...

  1. What is the trigger point to check for a file for blocks at DN? Is it all keys in system will be verified?
  2. Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.
  3. How about having similar thing in Recon also ?

@xichen01
Copy link
Contributor Author

@xichen01 I have few questions ...

  1. What is the trigger point to check for a file for blocks at DN? Is it all keys in system will be verified?
  2. Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.
  3. How about having similar thing in Recon also ?

@sumitagrawl Thanks for your questions.

  1. What is the trigger point...

Keys/drums/volumes/clusters can be checked based on “OzoneAddress”

  1. Recon already have job to check...

The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

  1. How about having similar ...

In the long run, it is possible, but this may require more development, and a command tool will be more flexible and simple

@errose28
Copy link
Contributor

This seems similar to the ozone debug read-replicas tool, except that it is expected to run faster because it is only checking block existence, not block data. Could we just add a flag to that command to tell it to only pull block metadata and accomplish the same result? Also how does the proposed headBlock differ from the existing getBlock request?

@xichen01
Copy link
Contributor Author

@errose28 Thanks for your questions.

except that it is expected to run faster because it is only checking block existence

Yes, it performs better, and in our internal version, with 6 buckets checked in parallel, the total QPS can be around 70k. And the main bottleneck is OM's ListKeys.

Could we just add a flag to that command to tell it to only pull block metadata and accomplish the same result ?

Do you mean we just check the DN Block in RocksDB, not to check the disk Block file? I think it's possible.

how does the proposed headBlock differ from the existing getBlock request

headBlock (which may be called something else) checks a number of Blocks at a time instead of one, and his return value can be simpler, i.e., it only returns the Block that is an exception, since it only checks for existence.

@xichen01
Copy link
Contributor Author

@sumitagrawl @errose28 Is there any update?

Comment on lines 2 to 3
title: Erasure Coding in Ozone
summary: Use Erasure Coding algorithm for efficient storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update title and summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sumitagrawl
Copy link
Contributor

sumitagrawl commented Jan 7, 2025

The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

Yes, Recon already have capability to identify missing container and reports them. It monitors all key and verify if any container is missing for the keys.

But at block level, if some block is deleted at physical disk, there is no direct mechanism to identify this till data is not read via Recon. But,

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state. Else mark the container to un-healthy so that replication can re-replicate. (I remember this is disabled by default, need recheck this).

cc: @errose28

@xichen01
Copy link
Contributor Author

xichen01 commented Jan 7, 2025

Recon already have job to check if any missing container. So IMO, checking container state to check for blocks may not be required. But if any container is not in healthy state (In Quasi-closed, deleted, ...) where chances of missing block is there, that can be reported as additional information by quering from SCM.


The output of the command will be all the missing keys, if we skip the Container state check, we may need to get this information from Recon.
And we have encountered some Missing Key Container seems to have never existed in the cluster, there is no any record in the SCM and Recon, this kind of scenario Recon can be found?

Yes, Recon already have capability to identify missing container and reports them. It monitors all key and verify if any container is missing for the keys.

But at block level, if some block is deleted at physical disk, there is no direct mechanism to identify this till data is not read via Recon. But,

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state. Else mark the container to un-healthy so that replication can re-replicate. (I remember this is disabled by default, need recheck this).

cc: @errose28

Thanks for your information.
Recon can handle Container exception keys ,but for Container exception keys if we don't list them in the output, then our output result will only report a part of the "Block missing Key", which may cause ambiguity, so in order to report the "Block missing" Key completely, so I think the container state check is necessary.
And if we want to check the Block on Datanode, container state check is hard to bypass.

There is a DN Container scan task is there which verify if container metadata and disk are in consistent state.

This relies on the Block being correctly placed in the Container, and the Block not being incorrectly deleted by the DN (i.e., a key that should not have been deleted through the normal deletion process), which is not guaranteed for a cluster that has been upgraded many times and run for a long time.

@kerneltime
Copy link
Contributor

Can you include the text in https://www.notion.so/meeting-room-Conference-Room-d17916fda32244f2b5edfec93c165cee?pvs=21 here itself, I tried to access it but I do not have access to it.

1. Retrieve Key metadata:
- Gather metadata such as the Key name and BlockID (consisting of ContainerID and localID).
- After collecting sufficient Key metadata, organize it for further processing.
- There are three approaches to retrieve metadata (detailed in [Three Approaches to Retrieve Metadata](https://www.notion.so/meeting-room-Conference-Room-d17916fda32244f2b5edfec93c165cee?pvs=21)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notion should be markdown, we should include it here in the PR itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants