Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIS-Operator: new proxy replicas cannot join the cluster after deleting the original master replica #208

Open
1 task done
eahydra opened this issue Jan 18, 2025 · 3 comments
Labels

Comments

@eahydra
Copy link

eahydra commented Jan 18, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Hi guys, I created a aistore cluster with ais-operator, the cluster has two proxy replicas. I tried to simulate a cluster failure to determine if the cluster was still working. So I performed the following steps:

  1. Cordon the node where the primary replica is running and delete the primary replica "aistore-proxy-0".
  2. Another replica, aistore-proxy-1, becomes the primary server.
  3. Update spec.proxySpec.size=3 in the AIStore CRD object to try to increase the scale.
  4. The new replica aistore-proxy-2 failed to join the cluster. From the log, it tried to connect the original primary replica due to the primary url in global config still is aistore-proxy-0.

Expected Behavior

The new replica aistore-proxy-2 should connect to the new primary replica aistore-proxy-1 and successfully join to the cluster.

Current Behavior

The new replica aistore-proxy-2 failed to join the cluster.

Steps To Reproduce

As I described in Describe the bug

Possible Solution

I think it would be a good idea to update the global config with the latest master URL the next time reconcile.

Additional Information/Context

No response

AIStore build/version

latest, ais-operator/latest

Environment details (OS name and version, etc.)

Ubuntu 22.04, K8s v1.30

@eahydra eahydra added the bug label Jan 18, 2025
@aaronnw
Copy link
Collaborator

aaronnw commented Jan 21, 2025

Thanks for opening, will take a look.

Right now a new node will start off trying to connect to proxy-0 and, ideally, if proxy-0 is not primary it will update the cluster map provided to the new node, including the current primary. But since in your case proxy-0 is not ready, this fails.

To address this, we could have the init container query the proxy service to set the correct primary in the initial config.

However if I understand correctly, this situation comes up because you're asking it to scale up when proxy-0 can't be scheduled onto a running node. There's no real reason proxies need to be a statefulset vs. a deployment, so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node. This way proxy-0 would simply be rescheduled and when proxy-2 comes up, proxy-0 would be ready to receive requests. (Targets are another issue -- inherently very stateful, so cordoning and setting up new PVs is a more risky/manual process.)

@eahydra
Copy link
Author

eahydra commented Jan 21, 2025

Thanks for your reply @aaronnw .

To address this, we could have the init container query the proxy service to set the correct primary in the initial config.

Agree. This is more reasonable than updating the primary url in the global configuration.

However if I understand correctly, this situation comes up because you're asking it to scale up when proxy-0 can't be scheduled onto a running node.

In fact, I want to mock a node failure scenario to test the election process of aistore proxy and the impact of the intermediate state on file reading and writing.

so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node.

I am curious, is the data synchronized between proxies just the list of asinodes (proxy and target)?

@alex-aizman
Copy link
Member

Just a quick reaction to something that was said earlier:

There's no real reason proxies need to be a statefulset vs. a deployment, so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node.

There's no reason, real or imaginary. Proxies can run anywhere with no restrictions or expectations other than intra-cluster connectivity at low latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants