Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16930 pool: Share map bulk resources #15780

Merged
merged 2 commits into from
Jan 24, 2025
Merged

DAOS-16930 pool: Share map bulk resources #15780

merged 2 commits into from
Jan 24, 2025

Conversation

NiuYawei
Copy link
Contributor

Improve concurrent POOL_QUERY, POOL_CONNECT, and POOL_TGT_QUERY_MAP efficiency by giving them a chance to share the same pool map buffer and pool map buffer bulk handle.

Introduce pool space query on service leader to avoid space query flooding. The pool space cache expiration time is 2 seconds by default, one can change the expiration time via DAOS_POOL_SPACE_CACHE_INTVL, if the expiration time is set to zero, space cache will be disabled.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Improve concurrent POOL_QUERY, POOL_CONNECT, and POOL_TGT_QUERY_MAP
efficiency by giving them a chance to share the same pool map buffer and
pool map buffer bulk handle.

Introduce pool space query on service leader to avoid space query
flooding. The pool space cache expiration time is 2 seconds by default,
one can change the expiration time via DAOS_POOL_SPACE_CACHE_INTVL, if
the expiration time is set to zero, space cache will be disabled.

Signed-off-by: Li Wei <[email protected]>
Co-authored-by: Niu Yawei <[email protected]>
Co-authored-by: Xuezhao Liu <[email protected]>
Co-authored-by: Liang Zhen <[email protected]>
@NiuYawei NiuYawei requested review from a team as code owners January 24, 2025 03:59
@NiuYawei NiuYawei requested a review from gnailzenh January 24, 2025 03:59
Copy link

Ticket title is 'Pool query fail on some pool with error "DER_NOMEM(-1009): Out of memory"'
Status is 'In Progress'
Labels: 'ALCF,post_acceptance_issues'
https://daosio.atlassian.net/browse/DAOS-16930

Comment on lines +446 to +448
# FIXME disable space cache since some tests need to verify instant pool space
# changing, this global setting to individual test setting once in follow-on PR.
"DAOS_POOL_SPACE_CACHE_INTVL=0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should be implementing this the other way around and adding it only to the tests that need this set in each of their test yamls. @daltonbohning thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, setting it to zero makes all tests behave exactly same as today (no cache for gathering space information, because some tests need to verify space change immediately, which is not required in real world), but it has scalability issue on Aurora if millions of processes try to get space information at the same time, so the default behavior is enabling cache even it's disabled for CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree that we should not change the default behavior in CI just to maintain current behavior. The problem is now we are not testing what will happen in production. So effectively this PR adds a feature but disables it in CI?

gnailzenh
gnailzenh previously approved these changes Jan 24, 2025
Comment on lines +446 to +448
# FIXME disable space cache since some tests need to verify instant pool space
# changing, this global setting to individual test setting once in follow-on PR.
"DAOS_POOL_SPACE_CACHE_INTVL=0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, setting it to zero makes all tests behave exactly same as today (no cache for gathering space information, because some tests need to verify space change immediately, which is not required in real world), but it has scalability issue on Aurora if millions of processes try to get space information at the same time, so the default behavior is enabling cache even it's disabled for CI.

liuxuezhao
liuxuezhao previously approved these changes Jan 24, 2025
liw
liw previously approved these changes Jan 24, 2025
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15780/1/execution/node/1420/log

@@ -443,6 +443,9 @@ class EngineYamlParameters(YamlParameters):
"DAOS_POOL_RF=4",
"CRT_EVENT_DELAY=1",
"DAOS_VOS_AGG_GAP=25",
# FIXME disable space cache since some tests need to verify instant pool space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint needs to be resolved

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint needs to be resolved

Skip-build: true

Signed-off-by: Dalton Bohning <[email protected]>
@daltonbohning daltonbohning dismissed stale reviews from liw, liuxuezhao, and gnailzenh via d809544 January 24, 2025 16:37
@daltonbohning
Copy link
Contributor

pylint needs to be resolved

Pushed a fix with most CI disabled since previous run was good except one known failure: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15780/1/pipeline

@mchaarawi mchaarawi merged commit 5f61167 into master Jan 24, 2025
37 of 39 checks passed
@mchaarawi mchaarawi deleted the niu/DAOS-16930 branch January 24, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants