Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16209 control: Add MD-on-SSD resp flag for display mode #15695

Merged
merged 15 commits into from
Jan 23, 2025

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jan 7, 2025

Rather than mutating mem_file_bytes to indicate PMem/MD-on-SSD mode in
pool query and create, use an explicit flag in the response instead.
This flag is then used to trigger a display style in the presentation layer.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr self-assigned this Jan 7, 2025
Copy link

github-actions bot commented Jan 7, 2025

Ticket title is 'Return VOS file capacity in addition to meta blob size on pool query'
Status is 'In Review'
Labels: 'md_on_ssd2'
https://daosio.atlassian.net/browse/DAOS-16209

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/1/execution/node/360/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/1/execution/node/261/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/1/execution/node/336/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/1/execution/node/306/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/1/execution/node/322/log

@tanabarr tanabarr force-pushed the tanabarr/control-memfilebytes-mode-mdonssd branch from 95f188d to 5c94599 Compare January 7, 2025 21:04
@tanabarr tanabarr changed the title DAOS-16209 control: Add MD-on-SSD response flag to trigger display sw… DAOS-16209 control: Add MD-on-SSD resp flag for display mode Jan 7, 2025
@tanabarr tanabarr marked this pull request as ready for review January 7, 2025 21:21
@tanabarr tanabarr requested review from a team as code owners January 7, 2025 21:21
@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15695/2/execution/node/370/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/2/testReport/

@tanabarr tanabarr requested review from mjmac, kjacque and knard38 January 8, 2025 11:21
@tanabarr tanabarr added control-plane work on the management infrastructure of the DAOS Control Plane meta-on-ssd Metadata on SSD Feature labels Jan 8, 2025
@tanabarr tanabarr requested a review from NiuYawei January 8, 2025 11:21
@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 8, 2025

PMem mode output with PR applied:

[tanabarr@wolf-311 daos]$ install-rocky/bin/dmg system query -v -i
Rank UUID                                 Control Address Fault Domain                  State  Reason
---- ----                                 --------------- ------------                  -----  ------
0    39d3d06f-dc11-45bd-8d7e-c09bc2f8dbcf 10.8.3.99:10001 /wolf-311.wolf.hpdd.intel.com Joined
1    77e52fbb-56c2-4b62-a142-eb40d05b594a 10.8.3.99:10001 /wolf-311.wolf.hpdd.intel.com Joined

[tanabarr@wolf-311 daos]$ install-rocky/bin/dmg -i pool create bob -z 50% --mem-ratio 50%
Creating DAOS pool with 50% of all storage
ERROR: dmg: pool create failed: server: code = 620 description = "pool create request contains MD-on-SSD parameters but MD-on-SSD has not been enabled"
ERROR: dmg: server: code = 620 resolution = "either remove MD-on-SSD-specific options from the command request or set bdev_roles in server config file to enable MD-on-SSD"
[tanabarr@wolf-311 daos]$ install-rocky/bin/dmg -i pool create bob -z 50%
Creating DAOS pool with 50% of all storage
Pool created with 38.24%,61.76% storage tier ratio
--------------------------------------------------
  UUID                 : 124b6556-eddb-4a80-9bd8-5c73c3c218cb
  Service Leader       : 0
  Service Ranks        : [0-1]
  Storage Ranks        : [0-1]
  Total Size           : 2.6 TB
  Storage tier 0 (SCM) : 989 GB (494 GB / rank)
  Storage tier 1 (NVMe): 1.6 TB (799 GB / rank)

[tanabarr@wolf-311 daos]$ install-rocky/bin/dmg -i pool query bob -e
Pool 124b6556-eddb-4a80-9bd8-5c73c3c218cb, ntarget=16, disabled=0, leader=0, version=1, state=Ready
Pool health info:
- Enabled ranks: 0-1
- Rebuild idle, 0 objs, 0 recs
Pool space info:
- Target count:16
- Storage tier 0 (SCM):
  Total size: 989 GB
  Free: 939 GB, min:59 GB, max:59 GB, mean:59 GB
- Storage tier 1 (NVME):
  Total size: 1.6 TB
  Free: 1.6 TB, min:100 GB, max:100 GB, mean:100 GB
[tanabarr@wolf-311 daos]$ install-rocky/bin/dmg -i storage query usage
Hosts     SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used
-----     --------- -------- -------- ---------- --------- ---------
localhost 2.1 TB    989 GB   52 %     3.2 TB     1.6 TB    50 %

MD-on-SSD mode output with PR applied:

[tanabarr@wolf-310 daos]$ install-rocky/bin/dmg -i pool create bob -z 50% --mem-ratio 50%
Creating DAOS pool with 50% of all storage
Pool created with 8.65%,91.35% storage tier ratio
-------------------------------------------------
  UUID             : f3931322-14f8-4c47-9b1e-204d3b2f6ac5
  Service Leader   : 0
  Service Ranks    : [0-1]
  Storage Ranks    : [0-1]
  Total Size       : 1.5 TB
  Metadata Storage : 129 GB (64 GB / rank)
  Data Storage     : 1.4 TB (681 GB / rank)
  Memory File Size : 64 GB (32 GB / rank)

[tanabarr@wolf-310 daos]$ install-rocky/bin/dmg -i pool query bob -e
Pool f3931322-14f8-4c47-9b1e-204d3b2f6ac5, ntarget=32, disabled=0, leader=0, version=1, state=Ready
Pool health info:
- Enabled ranks: 0-1
- Rebuild idle, 0 objs, 0 recs
Pool space info:
- Target count:32
- Total memory-file size: 64 GB
- Metadata storage:
  Total size: 129 GB
  Free: 115 GB, min:3.6 GB, max:3.6 GB, mean:3.6 GB
- Data storage:
  Total size: 1.4 TB
  Free: 1.4 TB, min:42 GB, max:42 GB, mean:42 GB
[tanabarr@wolf-310 daos]$ install-rocky/bin/dmg -i storage query usage
Tier Roles
---- -----
T1   data,meta,wal

Rank T1-Total T1-Free T1-Usage
---- -------- ------- --------
0    1.6 TB   749 GB  53 %
1    1.6 TB   749 GB  53 %

kjacque
kjacque previously approved these changes Jan 8, 2025
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup.

@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 9, 2025

I added bio.h to srv_drpc.c in order to access bio_configured_nvme() as we discussed. This enables population of a flag to indicate MD-on-SSD / PMem mode returned in pool create and query dRPC responses. This added a dependency on libbio for srv_drpc_tests so that to run the test binary I have to prefix with "LD_LIBRARY_PATH=install/lib64/daos_srv". How do I adjust so that run_test.py can run the test with the added dependency as currently it fails with /var/lib/jenkins/jenkins-1/docker_1/workspace/daos-stack_daos_PR-15695@2/build/dev/gcc/src/mgmt/tests/srv_drpc_tests: error while loading shared libraries: libbio.so: cannot open shared object file: No such file or directory (https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15695/2/artifact/unit_test_logs/src-mgmt-tests-srv_drpc_tests_31/output.log/*view*/) ? @jolivier23 @NiuYawei

@jolivier23 jolivier23 self-requested a review January 16, 2025 15:17
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/16/testReport/

…mfilebytes-mode-mdonssd

Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
Test-tag: hw,medium,DmgPoolQueryTest hw,medium,ListVerboseTest
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@tanabarr tanabarr requested review from knard38 and kjacque January 21, 2025 18:32
@tanabarr
Copy link
Contributor Author

https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15695/16/pipeline passed all but ListVerbose and DmgPoolQuery hardware medium functional tests. https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15695/17/pipeline failed with known NLT memcheck issues and verifies ftest-file-only fixes on hardware medium.

knard38
knard38 previously approved these changes Jan 22, 2025
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 22, 2025
…mfilebytes-mode-mdonssd

Features: pool
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr
Copy link
Contributor Author

merged master to resolve conflicts with protobuf files

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/18/testReport/

@knard38
Copy link
Contributor

knard38 commented Jan 23, 2025

merged master to resolve conflicts with protobuf files

I am probably missing something, but I do not see in the commit changes related to the protobuf file itself: I only see diff with the generated files.

@tanabarr
Copy link
Contributor Author

merged master to resolve conflicts with protobuf files

I am probably missing something, but I do not see in the commit changes related to the protobuf file itself: I only see diff with the generated files.

yes apologies it was only the generated file that conflicted

@phender
Copy link
Contributor

phender commented Jan 23, 2025

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/18/testReport/

The failing container/boundary.py test case is where 100 pools are created in parallel and then 200 containers are created in parallel. Of the 200 container creates 19 failed with DaosApiError('Container create returned non-zero. RC: -1004'. The server log reports:

01/23-00:16:36.39 wolf-126 DAOS[292353/0/3154] container ERR  src/container/srv_container.c:1082 cont_create() fbf91e4d/44676c9a: container already exists
...
01/23-00:16:36.50 wolf-126 DAOS[292353/0/3158] container ERR  src/container/srv_container.c:1082 cont_create() 7a4508ce/baa6caaf: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3159] container ERR  src/container/srv_container.c:1082 cont_create() 02934df1/ca949600: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3160] container ERR  src/container/srv_container.c:1082 cont_create() 81466790/4c79c191: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3161] container ERR  src/container/srv_container.c:1082 cont_create() c475df47/2f93bfb5: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3162] container ERR  src/container/srv_container.c:1082 cont_create() 5d9f4d7c/e3b65904: container already exists
...

This weekly test historically passes.

Created https://daosio.atlassian.net/browse/DAOS-16981 for this failure.

@phender
Copy link
Contributor

phender commented Jan 23, 2025

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/18/testReport/

The failing container/boundary.py test case is where 100 pools are created in parallel and then 200 containers are created in parallel. Of the 200 container creates 19 failed with DaosApiError('Container create returned non-zero. RC: -1004'. The server log reports:

01/23-00:16:36.39 wolf-126 DAOS[292353/0/3154] container ERR  src/container/srv_container.c:1082 cont_create() fbf91e4d/44676c9a: container already exists
...
01/23-00:16:36.50 wolf-126 DAOS[292353/0/3158] container ERR  src/container/srv_container.c:1082 cont_create() 7a4508ce/baa6caaf: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3159] container ERR  src/container/srv_container.c:1082 cont_create() 02934df1/ca949600: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3160] container ERR  src/container/srv_container.c:1082 cont_create() 81466790/4c79c191: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3161] container ERR  src/container/srv_container.c:1082 cont_create() c475df47/2f93bfb5: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3162] container ERR  src/container/srv_container.c:1082 cont_create() 5d9f4d7c/e3b65904: container already exists
...

This weekly test historically passes.

This does appear to be an issue with threading the pydaos container create instead of using the harness daos container create which would ensure unique container labels, but there is also no history of this test failing in weekly master test builds. I've kicked of https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15695/19/ to just run the container/boundary.py test to see if it will pass with the changes in this PR.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/19/testReport/

@tanabarr tanabarr requested a review from a team January 23, 2025 22:25
@phender
Copy link
Contributor

phender commented Jan 23, 2025

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15695/18/testReport/

The failing container/boundary.py test case is where 100 pools are created in parallel and then 200 containers are created in parallel. Of the 200 container creates 19 failed with DaosApiError('Container create returned non-zero. RC: -1004'. The server log reports:

01/23-00:16:36.39 wolf-126 DAOS[292353/0/3154] container ERR  src/container/srv_container.c:1082 cont_create() fbf91e4d/44676c9a: container already exists
...
01/23-00:16:36.50 wolf-126 DAOS[292353/0/3158] container ERR  src/container/srv_container.c:1082 cont_create() 7a4508ce/baa6caaf: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3159] container ERR  src/container/srv_container.c:1082 cont_create() 02934df1/ca949600: container already exists
01/23-00:16:36.51 wolf-126 DAOS[292353/0/3160] container ERR  src/container/srv_container.c:1082 cont_create() 81466790/4c79c191: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3161] container ERR  src/container/srv_container.c:1082 cont_create() c475df47/2f93bfb5: container already exists
01/23-00:16:36.52 wolf-126 DAOS[292353/0/3162] container ERR  src/container/srv_container.c:1082 cont_create() 5d9f4d7c/e3b65904: container already exists
...

This weekly test historically passes.

This does appear to be an issue with threading the pydaos container create instead of using the harness daos container create which would ensure unique container labels, but there is also no history of this test failing in weekly master test builds. I've kicked of https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15695/19/ to just run the container/boundary.py test to see if it will pass with the changes in this PR.

The container/boundary.py test passed in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15695/19/testReport/FTEST_container/BoundaryTest/

@phender phender merged commit 26c2219 into master Jan 23, 2025
53 of 58 checks passed
@phender phender deleted the tanabarr/control-memfilebytes-mode-mdonssd branch January 23, 2025 23:47
@tanabarr
Copy link
Contributor Author

Thanks @phender

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. meta-on-ssd Metadata on SSD Feature
Development

Successfully merging this pull request may close these issues.

8 participants