Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5000+ runs have neither master, pr_base or pr_head labels #1175

Closed
foolip opened this issue Feb 24, 2019 · 46 comments
Closed

5000+ runs have neither master, pr_base or pr_head labels #1175

foolip opened this issue Feb 24, 2019 · 46 comments
Assignees

Comments

@foolip
Copy link
Member

foolip commented Feb 24, 2019

When the master label was introduced some old runs were labeled, but it seems not enough.

In https://github.com/foolip/ad-hoc-wpt-results-analysis/blob/master/audit.js I listed runs with neither master, pr_base or pr_head labels, and there were more than 5000. Some probably aren't master runs, but some definitely are:
https://wpt.fyi/results/?run_id=5632908932939776

Some are probably from commits on master but incomplete, so fixing this might take some work.

@Hexcles
Copy link
Member

Hexcles commented Feb 24, 2019

This was how old runs were retroactively tagged: https://github.com/web-platform-tests/data-migration/blob/master/tagger/master.go

We used the commit log of WPT as the ground truth. I think the logic is sound, but there might be some gaps because of timing (e.g. tagged the old runs before the prod release, out-of-date local WPT checkout). However, the linked example is a very old run, so there might be bugs in the migration code or it wasn't run until completion.

@foolip
Copy link
Member Author

foolip commented Mar 4, 2019

@Hexcles, in #1195 (comment) I wrote a script to find some mislabeled Safari runs. Those would come up in an analysis like this.

I feel like all runs should have one of the 3 known labels, and if anything useful remains perhaps we should come up with a label for those too so that it's easier to be sure that nothing's gone wrong like this.

@foolip
Copy link
Member Author

foolip commented Apr 1, 2019

This is a problem in practice now. In trying to dust off my scripts for measuring results and interop over time, I see that some of the aligned SHAs I used that time around aren't found now because some of the runs don't have the master label.

https://wpt.fyi/results/?sha=a516e7edca&label=stable is the earliest such run, where the Firefox run from Taskcluster is missing the master label.

@foolip
Copy link
Member Author

foolip commented Apr 1, 2019

Simply not requiring the master label would probably work, but there might still be one-off runs with stable browsers that someone has done that then influence the results.

@Hexcles Hexcles self-assigned this Apr 4, 2019
@Hexcles
Copy link
Member

Hexcles commented Apr 4, 2019

Backfilling labels using web-platform-tests/data-migration@ea07520

@Hexcles
Copy link
Member

Hexcles commented Apr 4, 2019

Note that the backfiller is being tested on staging at the moment; prod isn't affected yet.

@Hexcles
Copy link
Member

Hexcles commented Apr 5, 2019

The heuristics aren't strong enough: the script is incorrectly labelling some supposedly pr_base runs as master. I'll use the test duration as another signal: only consider runs >10min (the rationale is that if a PR run can't take >10min, otherwise the stability check would time out). web-platform-tests/data-migration@423adc8

@Hexcles
Copy link
Member

Hexcles commented Apr 5, 2019

The new heuristics seem to work. I don't see any false positives, and the possibility of false negatives (true master runs taking <10min) is low. ~3000 runs are retroactively labelled as master on prod.

@foolip
Copy link
Member Author

foolip commented Apr 5, 2019

How much is left with none of the three labels, and is the origin of all of those runs clear?

@Hexcles
Copy link
Member

Hexcles commented Apr 5, 2019

We still have a couple of thousand, most of which are from Taskcluster and Azure, and I'm fairly certain they are pr_head or pr_base runs based on the short duration (test execution time should be trustworthy on these two runners). There was a short period when we did not add any labels because of a bug in the webhook.

I don't think backfilling the pr_head and pr_base labels has high value. WDYT?

@foolip
Copy link
Member Author

foolip commented Apr 5, 2019

Nah, adding those labels itself doesn't seem valuable, I was more curious about what would remain. I'd expect at most 100 or so runs triggered by folks trying stuff out by tweaking CI config, but if it's more I'd wonder if we've missed some master runs still.

@stephenmcgruer
Copy link
Contributor

@foolip This seems to be the issue you mentioned regarding old runs not marked as master.

@Hexcles When digging up foolip's https://github.com/foolip/ad-hoc-wpt-results-analysis scripts, I found that we didn't seem to have any runs before 2018-06 coming in as 'master'; which included many in the aligned-shas.txt file (which I was using to try and align my new graphs with his old ones).

https://wpt.fyi/api/runs?sha=b0c75cd8857f9f9596a744fac65c3be5e526aede is an example; I notice that the time_end is totally mangled (0001-01-01T00:00:00Z), maybe thats why our earlier attempts using heuristics missed these?

I'm not sure if we can find a different way to backdate into 2017, but re-opening to discuss.

@stephenmcgruer
Copy link
Contributor

Note that this probably causes #1537

@Hexcles
Copy link
Member

Hexcles commented Oct 29, 2019

Stephen, you are correct. This is a problem caused by the additional heuristic "only consider runs >10min". There was no false positive, but a lot of false negative because back in 2017 we did not record timestamps in wptreport JSON. The TimeStart field for those runs was backfilled with CreatedAt and TimeEnd was left empty (the time you're seeing is the zero value of time.Time in Go).

I could try a more robust heuristic (only apply the >10min rule for runs created after we set up pr_base ) if you think those historical runs are valuable -- note that the infra was not very reliable back then and we went through some massive renaming in the WPT repo as well (especially css).

@foolip
Copy link
Member Author

foolip commented Oct 29, 2019

Given that we have manifests now for all master commits since the beginning, can we just check the results against the manifests and label ones that are complete enough? Or just pick a constant number of results to expect?

@Hexcles
Copy link
Member

Hexcles commented Oct 29, 2019

Yes, but either way would require us to download the full JSON and deserialize it. It'd be very slow and I probably need to write some goroutine magics so that it won't take weeks (currently the labeller runs singly).

@foolip
Copy link
Member Author

foolip commented Oct 29, 2019

How about if I just ingest the runs into my wpt-results repo and from there just sort them by number of results? It'd still be slow, but almost all of the code exists and would just have to churn away for a few days.

foolip added a commit to Ecosystem-Infra/wpt-results-analysis that referenced this issue Oct 29, 2019
@foolip
Copy link
Member Author

foolip commented Oct 30, 2019

There are a lot less than 5000 of these runs now. After Ecosystem-Infra/wpt-results-analysis@b41047e and some local edits I see this output counting the runs in question:

Iterated 729 chrome runs
Iterated 136 edge runs
Iterated 715 firefox runs
Iterated 347 safari runs
Iterated 4 webkitgtk runs
Iterated 1931 runs in total

@stephenmcgruer
Copy link
Contributor

This has bit me for some analysis work I've been doing (before 2018-06, I basically have to drop the 'master' tag to get runs in). I think we should fix it, but I'm happy to own doing so since I am the one mainly impacted. (And @Hexcles has enough on their plate!)

@stephenmcgruer
Copy link
Contributor

I've begun looking at this. Thanks to ad-hoc-wpt-results-analysis, I have a promising approach that revolves around checking the results diff for the run. For example:

run/321670007/results has 36208 insertions
...
run/324880027/results has 62 insertions

What is surprising me, however, is how new a lot of these runs are. Both of the above runs are from this month! There's other weird stuff too; run 321670007 has a created_at that appears to be after its time_end... (Perhaps I just don't know what created_at means.)

For now, however, I'm going to look at the 2017 and 2018 runs (since we know there should be missing master runs from then). Plan is to generate run ids via https://github.com/foolip/ad-hoc-wpt-results-analysis/pull/14/files, then use a datastore fixup to add the master label.

@Hexcles
Copy link
Member

Hexcles commented Nov 24, 2019 via email

@stephenmcgruer
Copy link
Contributor

Ah, that would make sense then.

We should figure out how we're getting recent results with no pr_base/pr_head/master label though, if possible.

@stephenmcgruer
Copy link
Contributor

2017 runs were just tagged in Ecosystem-Infra/wpt-results-analysis#14; there were 251 of them.

@foolip
Copy link
Member Author

foolip commented Dec 4, 2019

Does that resolve this issue?

@stephenmcgruer
Copy link
Contributor

Sorry, no, should have been clearer. I am now working on 2018 runs and there are a lot more of them :).

@stephenmcgruer
Copy link
Contributor

Jan-June 2018 have been tagged in Ecosystem-Infra/wpt-results-analysis#17 (comment); there were 1162 of them.

Remaining work is just to check what is leftover. The expectation is that we will be done to <100 runs that we don't care about.

@stephenmcgruer
Copy link
Contributor

stephenmcgruer commented Dec 5, 2019

Well, I was wrong.

$ wc -l tagless-runs.txt 
948 tagless-runs.txt
$ grep 2019-01 tagless-runs.txt | wc -l
603

Any idea what happened in January @Hexcles ?

@Hexcles
Copy link
Member

Hexcles commented Dec 5, 2019

Are those SHAs actually on the master branch?

@stephenmcgruer
Copy link
Contributor

Have not checked them all for being in the repo in general, but they mostly do not appear to be in the list produced by git tag --list 'merge_pr_*' --format="%(objectname:short)" at least. I get 89 hits against that list from the 948 runs total.

@stephenmcgruer
Copy link
Contributor

They're all junk. One has 30k tests, the rest have <50 (many have 1-10).

2019-01-runs-log.txt

@Hexcles
Copy link
Member

Hexcles commented Dec 5, 2019

I didn't find any major fire during that time. And I'd say since they're not on master we can safely ignore them, in which case, this issue is officially fixed (again)!

Thanks so much, @stephenmcgruer !

@Hexcles Hexcles closed this as completed Dec 5, 2019
@foolip
Copy link
Member Author

foolip commented Dec 5, 2019

Yay, thanks @stephenmcgruer!

Now, I wonder if this resulted in many more aligned runs for historical analysis?

@stephenmcgruer
Copy link
Contributor

Hrm. I expected it to have a substantial impact, but I'm only seeing some more data for stable (2017-08 to 2017-10 has suddenly appeared, and 2018-03 to 2018-06) and no new data for experimental (still starts 2019-01).

I'm going to do some random sampling of the runs I labelled master and try to figure out what is happening.

@stephenmcgruer
Copy link
Contributor

I remembered I am a programmer.

$  python check_labels.py 2017-runs-tagged-master.txt
251 total
251 stable
0 experimental
0 other
python check_labels.py 2018-runs-tagged-master.txt
1162 total
694 stable
468 experimental
0 other

I guess we just didn't have STP in 2018; https://wpt.fyi/api/runs?label=experimental&product=safari&from=2018-01-01&to=2018-12-31 (not even including the master label here).

@foolip
Copy link
Member Author

foolip commented Dec 11, 2019

web-platform-tests/results-collection#112 was the original tracking issue for that and I think there should be some STP runs in 2018.

@stephenmcgruer
Copy link
Contributor

Yes, web-platform-tests/results-collection#112 (comment) implies there should be runs from mid-August 2018 onwards...

@stephenmcgruer
Copy link
Contributor

Flipping through https://wpt.fyi/api/runs?labels=safari&from=2018-08-15&to=2018-09-15 from month to month, however, I'm seeing no traces of them. There are runs from Safari 11 and 12, but both are marked stable.

Interestingly, in that thread you actually linked a sha and said there was a TP run for it, and there isn't any longer... https://wpt.fyi/results/?sha=eea0b54014&label=experimental

I wonder if we purged runs at some point for some reason? @Hexcles any ideas?

@foolip
Copy link
Member Author

foolip commented Dec 11, 2019

https://wpt.fyi/results/?sha=eea0b54014&max-count=10 is suspicious. It shows Safari 11.1 and 12.1 both as stable.

Before safaridriver --version existed the version reported was 12.1, so I'm pretty sure that 12.1 run and probably many others were actually STP. However, figuring out which version of STP they were would at this point be very hard. We could still relabel them as experimental, however, if we can find a rule that slices them out and doesn't capture any real 12.1 stable runs.

@Hexcles
Copy link
Member

Hexcles commented Dec 11, 2019

I concur.

According to https://en.wikipedia.org/wiki/Safari_version_history#Safari_12 , Safari 12.1 wasn't released until March 2019, so 12.1 back in 2018 was definitely a technology preview.

Most of the history can be found in web-platform-tests/wpt#14465. Tl;dr: we went back and forth on how (STP has, or at least had, both a future Safari version and a STP version, e.g. Safari 12.1 Technology Preview 70) and where (we had various parsing and overriding in wpt CLI, results-collection, and wpt.fyi) to best handle and represent STP versions, complicated by the fact there was no easy way to extract the version from CLI.

I think we might have accidentally labelled those runs as stable.

@foolip
Copy link
Member Author

foolip commented Dec 11, 2019

web-platform-tests/results-collection#617 has a bunch of history too. It looks like we fixed the version problem before Safari 12.1 was released, so just listing all 12.1 runs, checking that there's a gap and relabeling the first group of runs should be correct.

@stephenmcgruer
Copy link
Contributor

Ack, will look at doing that. Re-opening to track (though the issue is somewhat misleadingly named now, there's enough history here and this shouldn't be open long). Thanks for all the history folks!

@stephenmcgruer
Copy link
Contributor

Right, so listing all Safari stable runs whose browser version starts with 12.1, I see:

2019-10-08T00:02:10.175Z
...
2019-04-10T18:19:59.884Z
2019-04-10T12:16:44.96Z
2019-04-10T06:16:52.8Z
2019-04-09T12:17:00.64Z
2019-01-05T00:12:16.423Z
2019-01-04T00:12:18.446Z
2019-01-03T00:12:13.941Z
...
2018-08-07T12:34:25.644Z

My plan is to change every run whose browser_version starts with 12.1 and whose start_time is before 2019-02 from stable to experimental. Lmk if anyone disagrees.

@stephenmcgruer
Copy link
Contributor

Script has been run on staging + prod (logs on the PR). https://wpt.fyi/api/runs?label=experimental&product=safari&from=2018-01-01&to=2018-12-31 now shows a bunch of runs, let's see what this has done to the available data for experimental

@stephenmcgruer
Copy link
Contributor

We now have experimental data from 2018-08 to 2019-01. Probably as good as we're going to get :)

@foolip
Copy link
Member Author

foolip commented Jan 3, 2020

Running the audit.js script again, there are now 955 runs remaining, for 558 unique commits. Looks like only 77 of those commits are merge_pr_* commits. That seems like a plausible number of runs that might have been triggered for various verification purposes, and puts a low ceiling to how many real and useful master runs could be among these.

Thanks @stephenmcgruer!

@Hexcles
Copy link
Member

Hexcles commented Jan 3, 2020

Great job, Stephen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants