You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys, I created a aistore cluster with ais-operator, the cluster has two proxy replicas. I tried to simulate a cluster failure to determine if the cluster was still working. So I performed the following steps:
Cordon the node where the primary replica is running and delete the primary replica "aistore-proxy-0".
Another replica, aistore-proxy-1, becomes the primary server.
Update spec.proxySpec.size=3 in the AIStore CRD object to try to increase the scale.
The new replica aistore-proxy-2 failed to join the cluster. From the log, it tried to connect the original primary replica due to the primary url in global config still is aistore-proxy-0.
Expected Behavior
The new replica aistore-proxy-2 should connect to the new primary replica aistore-proxy-1 and successfully join to the cluster.
Current Behavior
The new replica aistore-proxy-2 failed to join the cluster.
Steps To Reproduce
As I described in Describe the bug
Possible Solution
I think it would be a good idea to update the global config with the latest master URL the next time reconcile.
Additional Information/Context
No response
AIStore build/version
latest, ais-operator/latest
Environment details (OS name and version, etc.)
Ubuntu 22.04, K8s v1.30
The text was updated successfully, but these errors were encountered:
Right now a new node will start off trying to connect to proxy-0 and, ideally, if proxy-0 is not primary it will update the cluster map provided to the new node, including the current primary. But since in your case proxy-0 is not ready, this fails.
To address this, we could have the init container query the proxy service to set the correct primary in the initial config.
However if I understand correctly, this situation comes up because you're asking it to scale up when proxy-0 can't be scheduled onto a running node. There's no real reason proxies need to be a statefulset vs. a deployment, so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node. This way proxy-0 would simply be rescheduled and when proxy-2 comes up, proxy-0 would be ready to receive requests. (Targets are another issue -- inherently very stateful, so cordoning and setting up new PVs is a more risky/manual process.)
To address this, we could have the init container query the proxy service to set the correct primary in the initial config.
Agree. This is more reasonable than updating the primary url in the global configuration.
However if I understand correctly, this situation comes up because you're asking it to scale up when proxy-0 can't be scheduled onto a running node.
In fact, I want to mock a node failure scenario to test the election process of aistore proxy and the impact of the intermediate state on file reading and writing.
so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node.
I am curious, is the data synchronized between proxies just the list of asinodes (proxy and target)?
Just a quick reaction to something that was said earlier:
There's no real reason proxies need to be a statefulset vs. a deployment, so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node.
There's no reason, real or imaginary. Proxies can run anywhere with no restrictions or expectations other than intra-cluster connectivity at low latency.
Is there an existing issue for this?
Describe the bug
Hi guys, I created a aistore cluster with ais-operator, the cluster has two proxy replicas. I tried to simulate a cluster failure to determine if the cluster was still working. So I performed the following steps:
aistore-proxy-1
, becomes the primary server.spec.proxySpec.size=3
in the AIStore CRD object to try to increase the scale.aistore-proxy-2
failed to join the cluster. From the log, it tried to connect the original primary replica due to the primary url in global config still isaistore-proxy-0
.Expected Behavior
The new replica
aistore-proxy-2
should connect to the new primary replicaaistore-proxy-1
and successfully join to the cluster.Current Behavior
The new replica
aistore-proxy-2
failed to join the cluster.Steps To Reproduce
As I described in
Describe the bug
Possible Solution
I think it would be a good idea to update the global config with the latest master URL the next time reconcile.
Additional Information/Context
No response
AIStore build/version
latest, ais-operator/latest
Environment details (OS name and version, etc.)
Ubuntu 22.04, K8s v1.30
The text was updated successfully, but these errors were encountered: