-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16930 pool: Share map bulk resources #15780
Conversation
Improve concurrent POOL_QUERY, POOL_CONNECT, and POOL_TGT_QUERY_MAP efficiency by giving them a chance to share the same pool map buffer and pool map buffer bulk handle. Introduce pool space query on service leader to avoid space query flooding. The pool space cache expiration time is 2 seconds by default, one can change the expiration time via DAOS_POOL_SPACE_CACHE_INTVL, if the expiration time is set to zero, space cache will be disabled. Signed-off-by: Li Wei <[email protected]> Co-authored-by: Niu Yawei <[email protected]> Co-authored-by: Xuezhao Liu <[email protected]> Co-authored-by: Liang Zhen <[email protected]>
Ticket title is 'Pool query fail on some pool with error "DER_NOMEM(-1009): Out of memory"' |
# FIXME disable space cache since some tests need to verify instant pool space | ||
# changing, this global setting to individual test setting once in follow-on PR. | ||
"DAOS_POOL_SPACE_CACHE_INTVL=0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should be implementing this the other way around and adding it only to the tests that need this set in each of their test yamls. @daltonbohning thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, setting it to zero makes all tests behave exactly same as today (no cache for gathering space information, because some tests need to verify space change immediately, which is not required in real world), but it has scalability issue on Aurora if millions of processes try to get space information at the same time, so the default behavior is enabling cache even it's disabled for CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree that we should not change the default behavior in CI just to maintain current behavior. The problem is now we are not testing what will happen in production. So effectively this PR adds a feature but disables it in CI?
# FIXME disable space cache since some tests need to verify instant pool space | ||
# changing, this global setting to individual test setting once in follow-on PR. | ||
"DAOS_POOL_SPACE_CACHE_INTVL=0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, setting it to zero makes all tests behave exactly same as today (no cache for gathering space information, because some tests need to verify space change immediately, which is not required in real world), but it has scalability issue on Aurora if millions of processes try to get space information at the same time, so the default behavior is enabling cache even it's disabled for CI.
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15780/1/execution/node/1420/log |
@@ -443,6 +443,9 @@ class EngineYamlParameters(YamlParameters): | |||
"DAOS_POOL_RF=4", | |||
"CRT_EVENT_DELAY=1", | |||
"DAOS_VOS_AGG_GAP=25", | |||
# FIXME disable space cache since some tests need to verify instant pool space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint needs to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint needs to be resolved
Skip-build: true Signed-off-by: Dalton Bohning <[email protected]>
d809544
Pushed a fix with most CI disabled since previous run was good except one known failure: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15780/1/pipeline |
Improve concurrent POOL_QUERY, POOL_CONNECT, and POOL_TGT_QUERY_MAP efficiency by giving them a chance to share the same pool map buffer and pool map buffer bulk handle.
Introduce pool space query on service leader to avoid space query flooding. The pool space cache expiration time is 2 seconds by default, one can change the expiration time via DAOS_POOL_SPACE_CACHE_INTVL, if the expiration time is set to zero, space cache will be disabled.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: