Long running preview powered by ArgoRollouts #1661

wanghong230 · 2021-11-17T04:30:12Z

wanghong230
Nov 17, 2021
Collaborator

Copied from a community question:

Revision 1.0 is out
Revision 1.1 is to go out as a preview release
Argo Rollout thinks it’s doing steps that take traffic gradually from 1.0 to 1.1, but by the end of the rollout, 1.1 now has 100% of the traffic.
However, for preview purposes, we don’t want that.
Sure, I could use the above workaround to leave n historical revisions around, and overlay yaml for VS and DR via pipeline for now (until we have declarative end state in Argo itself to accomplish this).
But, b/t applying the final VS/DR yaml and the end of the rollout, a version only intended for preview is actually taking 100% of all traffic.
And if anything goes wonky, and new yaml doesn’t successfully apply, things could be weirder yet for getting back to a sane state.
How do we solve this? For my current rollout step 0, we do 0% public traffic, and only hit it for canary tests. but beyond that, we curretly increment by weighted % of traffic, and final step completes this to 100%. It’s like we need alternate step config that can complete the rollout w/o ever taking it to 100%, which again segues back to need to promote a rollout in some state, but then have a declarative end state allowig multiple “experiments” or something to coexist w/ their own traffic rules in Istio. Right now, rollouts referenes VS and DR by name….maybe the end-state could refer to another VS/DR pair by their names.

wanghong230 · 2021-11-17T04:31:22Z

wanghong230
Nov 17, 2021
Collaborator Author

@jessesuen Let me know if you can shed some light on this conversation.

1 reply

pa4h1u3-BRONGA Nov 23, 2021

@jessesuen we have a few scenarios like this that we need a clean pattern and solve for. As we discussed in our zoom call not long back, this isn't an issue w/ supporting the requirements individually w/ Istio's traffic management capabilities. It's about how we tackle this to cleanly transition at the end of the ci/cd pipeline from virtual service / destination rule config that undergirds Argo from that which drives behavior with more than one co-existent version at the end of the build.

Preview mode -- see above. The basic idea is to rollout a new version, to leverage Argo in doing so, but to not graduate the new version to 100% of traffic upon passing canary tests and triggering a promote on the rollout. The new version will become and remain a long-lived preview version for a time (say ≈1 week).
A twist on this has to do with how we rollout new versions to our stores. We want to push the new version out, but to graduate the user base using Istio traffic shaping after the progressive rollout is complete. Again w/ this use case, we don't want to even temporarily put 100% of the traffic on the new version during progressive delivery.

Initially, current stable takes 100% of traffic.
Canary tests hit target of new, but w/ 0% public traffic (no weighted distribution at this point)
Promote rollout, but NOT cutting over general traffic through % of weight based steps.
After CI/CD build, Istio traffic mgmt should be set so that only specialized QA devices hit the new version.
In <24h, a pilot fleet would have traffic routed to the new version.
In ≈ 1 week, all traffic would cut over to the new version.

Long-lived A/B testing -- your Experimentals are a promising option here, but their use within rollout steps isn't as long as we need for what we do here.

As noted above, it's not an issue of Istio not supporting just these kinds of use cases. The question is how we seamlessly go from our VS/DR objects being wired per expected "plumbing" for Argo to it's state upon build completion where drivers of traffic to old vs new shift immediately after progressive delivery, and perhaps a few more times over the next week.

I'm seeing 3 main needs common to the above scenarios:

How do we promote a rollout on the new version without shifting weight fully to the new version? Right now we use a canary step 0 with a match rule to use a header for smoke testing purposes, but in steps 1...n the rollout spec shifts to weight based and gradually climbs from 0 to 100%.
How do we seamlessly shift our VS / DR config from match rules and traffic shaping config needed at progressive delivery time during CD to what is the state upon CD completion?
Tied to this second need, I'd love to have a declarative way to specify and end state -- what traffic rules are to be in place upon promote? upon rollback or undo?

kostis-codefresh · 2021-11-25T09:34:06Z

kostis-codefresh
Nov 25, 2021
Collaborator

A twist on this has to do with how we rollout new versions to our stores. We want to push the new version out, but to graduate the user base using Istio traffic shaping after the progressive rollout is complete. Again w/ this use case, we don't want to even temporarily put 100% of the traffic on the new version during progressive delivery.

Maybe I am missing something obvious here, but why then use Argo Rollouts in the first place? Just deploy a second deployment with your "preview" service using your normal deployment method (ArgoCD or whatever) and then use Istio to do whatever traffic splits that you want. And you can keep both versions running as long as you like.

On a related note, isn't 1 week a bit excessive for a rollout? If a release has issues, wouldn't you know that in the first hours? Did you have incidents where a release was fine for 5 days and you found issues in the 6th day? And you had to rollback instead of doing a hotfix?And if yes, what percentage of your deployments had this? I am just trying to understand if this is a corner case/nice-to-have feature or you consider this essential functionality.

On another note, Istio is not the only servicemesh/gateway out there. We need to make sure that all core functionality of Argo Rollouts is agnostic for the networking layer.

Finally, even if tomorrow Argo Rollouts had a magic way to handle long running releases, there are several questions that need to be answered. For example if in the middle of the week (when there are already two versions running) you deploy a 3rd version, what is the expected result? Would you expect Argo Rollouts to manage 3 versions now? Or keep only the last two and immediately discard the oldest (conflicting with the whole point of progressive delivery)?

1 reply

pa4h1u3-BRONGA Nov 30, 2021

In the case of A/B testing for evaluating impact on biz of various features, content, etc, deploying two services independently may make sense, though the Experimental feature of Argo Rollouts is still attractive here depending on how that evolves. But in the preview scenario, we do still want to benefit from the progressive rollout from version #.n to #.n+1, including the ability to abort #.n+1 if during CD time our canary tests (or analysis templates) fail. But assuming the rollout is promoted, we still need the ability to leave the old version behind and have traffic handled accordingly. It'd be nice to have that end-state, so that instead of just assuming final step takes traffic to 100% weight on #.n+1, we actually point to a traffic shaping config to be used by end of promote.

As a general software engineering principle, I'm with you on timeframes. For most of our teams, the push is to automate into canary tests / analysis templates and to fail fast. For those teams, by the time canary tests are done, we know if we are doing promote or undo. If something is discovered after that, it's handled as a one-off and the team should own shoring up their automated tests. That said, we do have some biz scenarios where, at least in current state, the rollout is by design very gradual from partial to full fleet, and we need to support this.

Re: being agnostic to networking layer, I agree. I'm simply referencing Istio as that's what we use. Argo Rollouts attempt to have feature parity across the supported options is good, though docs do sometimes call out capabilities that are unique to only a subset of the options supported.

Your last question will come down to how concurrent long running versions are expressed declaratively, for rollout time as well as that "end-state" that will persist beyond that. Personally, I would want the rollout YAML to declaratively express desired new state upon success.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running preview powered by ArgoRollouts #1661

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Long running preview powered by ArgoRollouts #1661

wanghong230 Nov 17, 2021 Collaborator

Replies: 2 comments · 2 replies

wanghong230 Nov 17, 2021 Collaborator Author

pa4h1u3-BRONGA Nov 23, 2021

kostis-codefresh Nov 25, 2021 Collaborator

pa4h1u3-BRONGA Nov 30, 2021

wanghong230
Nov 17, 2021
Collaborator

Replies: 2 comments 2 replies

wanghong230
Nov 17, 2021
Collaborator Author

kostis-codefresh
Nov 25, 2021
Collaborator