Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Likelihood of Duplicate Dependency Graph Update Workflow PRs #1379

Merged
merged 3 commits into from
Jan 6, 2025

Conversation

kelvin-chappell
Copy link
Member

@kelvin-chappell kelvin-chappell commented Jan 6, 2025

What does this change?

Moves schedule of RepoCop and CloudBuster production ECS tasks from 09:30 GMT weekdays to 07:00 GMT weekdays.

Why?

There's a scenario we've seen in PRs https://github.com/guardian/contributions-platform/pull/551 and https://github.com/guardian/contributions-platform/pull/552 where the dependency-graph-integrator job runs working on stale data.
The job makes decisions depending on the service catalogue github-workflows table, which is updated at midnight GMT.
This means that if a PR is merged between midnight and 09:30 GMT a duplicate will be created at 09:30 because the state of the repo's GHA workflows are no longer reflected in the github-workflows table.

This solution is the simplest possible but it doesn't eliminate the problem. It now means that PRs merged between midnight and 03:00 GMT will be duplicated at 03:00. There's a far lower likelihood that a PR will be merged between these hours.

How has it been verified?

Monitor once in production:

  • A successful RepoCop task has run at 03:00 GMT
  • A successful CloudBuster task has run at 03:00 GMT

@kelvin-chappell kelvin-chappell requested review from a team as code owners January 6, 2025 12:17
Copy link
Contributor

@tjsilver tjsilver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to then run repocop more closely after the workflows table is updated? Say, 1am?

@kelvin-chappell
Copy link
Member Author

Is there any reason not to then run repocop more closely after the workflows table is updated? Say, 1am?

No, not really. Does the cloudquery job ever have to retry or take a long time to start? I could move it to 3am to give a safe timeframe for successful data population and avoid the 0-2 am BST changeover weirdness

@tjsilver
Copy link
Contributor

tjsilver commented Jan 6, 2025

Is there any reason not to then run repocop more closely after the workflows table is updated? Say, 1am?

No, not really. Does the cloudquery job ever have to retry or take a long time to start? I could move it to 3am to give a safe timeframe for successful data population and avoid the 0-2 am BST changeover weirdness

That sounds sensible. What do you think @akash1810 ?

Copy link
Contributor

@tjsilver tjsilver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@akash1810
Copy link
Member

Does the cloudquery job ever have to retry or take a long time to start? I could move it to 3am to give a safe timeframe for successful data population and avoid the 0-2 am BST changeover weirdness

We don't explicitly retry any failed tasks. There's also no guarentee how long a task will run for; we can use metrics from previous invocations to get an idea but accuracy will be within 10 minutes I'd imagine.

I think switching to being event driven vs cron driven would be the ultimate solution here. That is, run when the dependant CloudQuery tasks have successfully completed... whenever that might be. This might be more involved than this change though and shouldn't block getting this out.

@kelvin-chappell
Copy link
Member Author

I think switching to being event driven vs cron driven would be the ultimate solution here. That is, run when the dependant CloudQuery tasks have successfully completed... whenever that might be. This might be more involved than this change though and shouldn't block getting this out.

👍 Completely agree. I'm in the process of looking at the dependency graph of Repocop to work out the event chain. But this is a simple short-term mitigation.

@kelvin-chappell kelvin-chappell merged commit 5e8f2fd into main Jan 6, 2025
7 checks passed
@kelvin-chappell kelvin-chappell deleted the kc/repocop-schedule branch January 6, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants