You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From time to time my set of HA testing on DRBD has shown that some DRBD resources go into the "Consistent" state when a node carrying them is powered off and powered on after a prolonged period of time.
Setup 1 -
This happens 7/10 times on the following setup -
3 nodes with 1 disk node and 2 diskless nodes.
Replication and auto-eviction are turned off.
Test case procedures -
Shutdown the ONLY disk node and wait for 30 mins before powering on.
linstor resource list shows some resources in Consistent state while their Diskless counterpart is Usage = InUse
There are only 2 replicas associated with a single resource - Diskless and Disk.
As a workaround, kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup disconnect <Consistent-state-resource-name> <node-id of diskless peer> kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup connect <Consistent-state-resource-name> <node-id of diskless peer>
This brings back the resource into UpToDate state. BUT sometimes this workaround puts the resource into Outdated and then it becomes a totally different problem to solve which I don't know how to recover from when this is the only physical replica available on a cluster and the Diskless resource is connected to it.
Setup 2 -
This issue happens almost 2/10 times on a 3 node cluster with 2 disk nodes and 1 diskless node.
Replication is turned on and auto-eviction is turned off.
Test case procedures -
Shutdown any disk node and wait for 30 mins before powering on.
linstor resource list shows some resources in Consistent state.
This one is easy to get by because replication is turned on so the other replica becomes Primary and starts to serve data. So, I can use drbdsetup disconnect and connect and even delete this resource when it goes into an Outdated state.
However, it is not straightforward if the application replicates data by itself and does not use DRBD for replication. In other words, DRBD replication is turned off for this resource which goes back to a situation similar to Setup 1
For instance on a node cluster with 1 disk and 2 diskless nodes -
linstor resource list
Please see below for a complete set of resources. pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac is marked as Consistent state
What is the output of drbdsetup status on the Satellite pod for the node with the Consistent resource in this situation? It may be that linstor has missed the state change, even though it is fine at the DRBD level.
From time to time my set of HA testing on DRBD has shown that some DRBD resources go into the "Consistent" state when a node carrying them is powered off and powered on after a prolonged period of time.
Setup 1 -
This happens 7/10 times on the following setup -
3 nodes with 1 disk node and 2 diskless nodes.
Replication and auto-eviction are turned off.
Test case procedures -
linstor resource list
shows some resources inConsistent
state while theirDiskless
counterpart is Usage =InUse
There are only 2 replicas associated with a single resource - Diskless and Disk.
As a workaround,
kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup disconnect <Consistent-state-resource-name> <node-id of diskless peer>
kubectl exec -n piraeus <ns_pod-name> -c linstor-satellite -- drbdsetup connect <Consistent-state-resource-name> <node-id of diskless peer>
This brings back the resource into UpToDate state. BUT sometimes this workaround puts the resource into
Outdated
and then it becomes a totally different problem to solve which I don't know how to recover from when this is the only physical replica available on a cluster and the Diskless resource is connected to it.Setup 2 -
This issue happens almost 2/10 times on a 3 node cluster with 2 disk nodes and 1 diskless node.
Replication is turned on and auto-eviction is turned off.
Test case procedures -
linstor resource list
shows some resources inConsistent
state.This one is easy to get by because replication is turned on so the other replica becomes Primary and starts to serve data. So, I can use
drbdsetup disconnect and connect
and even delete this resource when it goes into anOutdated
state.However, it is not straightforward if the application replicates data by itself and does not use DRBD for replication. In other words, DRBD replication is turned off for this resource which goes back to a situation similar to Setup 1
For instance on a node cluster with 1 disk and 2 diskless nodes -
linstor resource list
Please see below for a complete set of resources. pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac is marked as Consistent state
k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor r l
+--------------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |
|================================================================================================================================|
| pvc-5bac2d82-9e42-4e0d-b828-3c598f2d8795 | flex188-126.dr.avaya.com | 7009 | Unused | Ok | UpToDate | 2022-03-22 04:58:52 |
| pvc-5bac2d82-9e42-4e0d-b828-3c598f2d8795 | flex188-128.dr.avaya.com | 7009 | InUse | Ok | Diskless | 2022-03-22 04:58:53 |
| pvc-16c0b34b-bed3-4219-8b9f-415e8d1734fb | flex188-126.dr.avaya.com | 7005 | InUse | Ok | UpToDate | 2022-03-22 04:56:35 |
| pvc-19dc5cea-733a-41b3-bd83-a2c4ea5012da | flex188-126.dr.avaya.com | 7004 | InUse | Ok | UpToDate | 2022-03-22 04:56:31 |
| pvc-32e7a7bf-f0c2-4bca-941b-102780fcf7bd | flex188-126.dr.avaya.com | 7003 | Unused | Ok | UpToDate | 2022-03-22 05:49:52 |
| pvc-32e7a7bf-f0c2-4bca-941b-102780fcf7bd | flex188-128.dr.avaya.com | 7003 | InUse | Ok | Diskless | 2022-03-22 05:49:54 |
| pvc-367dca54-39c8-415e-9633-0295730bbd44 | flex188-126.dr.avaya.com | 7002 | InUse | Ok | UpToDate | 2022-03-21 04:47:47 |
| pvc-69090137-263d-4cca-b402-02fd5f377041 | flex188-126.dr.avaya.com | 7008 | Unused | Ok | UpToDate | 2022-03-22 04:56:48 |
| pvc-69090137-263d-4cca-b402-02fd5f377041 | flex188-128.dr.avaya.com | 7008 | InUse | Ok | Diskless | 2022-03-22 04:56:52 |
| pvc-a36f02aa-da1a-4eaf-b797-b035dfcd5a22 | flex188-126.dr.avaya.com | 7006 | Unused | Ok | UpToDate | 2022-03-22 04:56:36 |
| pvc-a36f02aa-da1a-4eaf-b797-b035dfcd5a22 | flex188-127.dr.avaya.com | 7006 | InUse | Ok | Diskless | 2022-03-22 04:56:38 |
| pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac | flex188-126.dr.avaya.com | 7012 | Unused | Ok | Consistent | 2022-03-22 06:22:05 |
| pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac | flex188-128.dr.avaya.com | 7012 | InUse | Ok | Diskless | 2022-03-22 16:20:24 |
| pvc-c806c751-1efd-405b-bf63-22c7ffc53ede | flex188-126.dr.avaya.com | 7007 | InUse | Ok | UpToDate | 2022-03-22 04:56:48 |
| pvc-cf52d305-926d-40b3-95b8-72c07b623d19 | flex188-126.dr.avaya.com | 7001 | InUse | Ok | UpToDate | 2022-03-21 04:47:46 |
| pvc-d39187c2-d560-42db-8dc3-c6e57505ae72 | flex188-126.dr.avaya.com | 7000 | Unused | Ok | UpToDate | 2022-03-21 04:07:50 |
| pvc-d39187c2-d560-42db-8dc3-c6e57505ae72 | flex188-127.dr.avaya.com | 7000 | InUse | Ok | Diskless | 2022-03-21 04:07:53 |
+--------------------------------------------------------------------------------------------------------------------------------+
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm dstate pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Consistent/Diskless
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm cstate pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Connected
k exec -n piraeus piraeus-op-piraeus-operator-ns-node-cv9qs -c linstor-satellite -- drbdadm dump pvc-ae4bb911-a227-4dde-a81a-2911d1c14aac
Software version -
k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- linstor --version
linstor 1.13.0; GIT-hash: 840cf57c75c166659509e22447b2c0ca6377ee6d
k exec --namespace=piraeus deployment/piraeus-op-piraeus-operator-cs-controller -- drbdadm -V
DRBDADM_BUILDTAG=GIT-hash:\ 087ee6b4961ca154d76e4211223b03149373bed8\ build\ by\ @buildsystem,\ 2022-01-28\ 12:19:33
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090106
DRBD_KERNEL_VERSION=9.1.6
DRBDADM_VERSION_CODE=0x091402
DRBDADM_VERSION=9.20.2
piraeus-operator-1.8.0
uname -a
4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: