Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use weighted sampling for Asia builds #1106

Merged
merged 3 commits into from
Aug 22, 2024
Merged

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented May 3, 2024

tracked by #1141

Description of proposed changes

This replaces the Asia/China/India split with population-based weighted sampling (possible with nextstrain/augur#1454).

Previews

Analysis

I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:

  1. With weighted sampling, more sequences are included for large countries such as Indonesia which did not get the same treatment as China and India prior to this PR.
  2. With weighted sampling, there is a high chance that no sequences are included for small countries such as Maldives.
  3. The sampling difference results in some subtle but maybe significant differences in the frequencies panel.

Questions for reviewers

  1. What can be done to check that the sampling results are appropriate?
  2. ✅ Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data. (population size yes, case counts no)

Notable comment threads

Checklist

Release checklist

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@victorlin victorlin self-assigned this May 3, 2024
@victorlin
Copy link
Member Author

Not ready to be merged but opening for initial review and feedback to shape implementation in nextstrain/augur#1454.

@victorlin victorlin marked this pull request as ready for review May 8, 2024 17:42
genehack
genehack previously approved these changes May 8, 2024
@jameshadfield jameshadfield marked this pull request as draft May 8, 2024 22:08
@joverlee521
Copy link
Contributor

I was thinking per-country case counts over time might be better, but not sure where to get that data.

We ingest country case counts from OWID in forecasts-ncov and upload them to s3://nextstrain-data/files/workflows/forecasts-ncov/cases/global.tsv.gz.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 5d2c691 to 6b6cd2b Compare May 10, 2024 18:45
@victorlin victorlin changed the title Use population-based weighted sampling for Asia builds Use weighted sampling for Asia builds May 10, 2024
@trvrb
Copy link
Member

trvrb commented Jun 6, 2024

Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data.

Case counts no longer mean anything and haven't really meant anything since early 2022. We can see this today in the parsed case counts file country_week_weights.tsv where Afghanistan with a population of 41M reported 638 cases in the week of (2024, 20), while Brunei with a population of 500k reported 175 cases in this same time period. This is 1.5 cases per 100k in Afganistran and 35 cases per 100k in Brunei. This 23-fold difference is going to be almost entirely reporting rate differences.

Similarly, South Korea has 135,331 cases listed in your TSV for (2024,20). This is 265 cases per 100k and so a 176-fold difference in reporting rate relative to Afghanistan.

If you do subsampling based on reported cases it will strongly bias included samples to (wealthier) countries with better surveillance.

For ongoing ncov builds, we should be weighting on population size. I wouldn't worry about admin division population size and would just worry about country-level population sizes in terms of weighting.

I'd start by rolling back 6b6cd2b. Ah! I see this was already discussed here #1106 (comment). Can you update PR with what I should be working from for review?

@trvrb
Copy link
Member

trvrb commented Jun 6, 2024

Following from this, I worked from 1afb6d7 and tried:

python3 scripts/get_population_sizes.py --output data/country_population_weights.tsv

from a freshly updated nextstrain shell command and got the following error

Traceback (most recent call last):
  File "/nextstrain/build/scripts/get_population_sizes.py", line 67, in <module>
    export_imf(args.output)
  File "/nextstrain/build/scripts/get_population_sizes.py", line 13, in export_imf
    workbook = xlrd.open_workbook(file_contents=excel_contents, ignore_workbook_corruption=True)
TypeError: open_workbook() got an unexpected keyword argument 'ignore_workbook_corruption'

and if I drop ignore_workbook_corruption=True in line 13 of the script I get an error about a corrupted workbook:

xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

https://www.cia.gov/the-world-factbook/field/population/country-comparison/#:~:text=DOWNLOAD-,DATA,-Rank looks potentially cleaner.

Another suggestion here: my preference for this sort of thing is to version the data. It's much easier to think about and it's not something that needs to be updated day-to-day. I'd just run a script to prepare country_population_weights.tsv once and not have it be part of the workflow. I'd version this to defaults/country_population_weights.tsv. This particular file is very similar to defaults/mutational_fitness_distance_map. In that case I had used scripts/developer_scripts/parse_mutational_fitness_tsv_into_distance_map, with the developer_scripts/ placement to signal that this is not part of the regular workflow.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from 6b6cd2b to 256a1ff Compare June 6, 2024 22:46
@victorlin
Copy link
Member Author

I've dropped 6b6cd2b and updated f038890 to use "the world factbook" as source of population data. Also moved the script and data files into version control (great suggestion).

I'm going to run this through the trial builds to check if the updated data source contains the right country names.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from dedef9d to c3a78e9 Compare June 7, 2024 23:56
@victorlin
Copy link
Member Author

Cleaned up the commits and updated trial build links in the PR description. Everything is still draft with a few lingering FIXMEs in the changes. Next week I plan to shift back to nextstrain/augur#1454 and tweak the implementation logic before coming back here.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from ae7b62e to 40d2994 Compare July 2, 2024 00:19
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from 40d2994 to 4e5ee67 Compare July 17, 2024 19:48
@victorlin victorlin marked this pull request as ready for review July 17, 2024 19:50
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, @victorlin! Regarding your question:

What can be done to check that the sampling results are appropriate?

The main pattern I'm looking for is that the pie chart sizes for Asian countries in the map view for the GISAID builds are proportional to the actual country populations. The all-time GISAID map view looks consistent with the population sizes. For example, Indonesia has more samples than Japan and Malaysia but fewer than China and India. The Philippines and Vietnam have similar numbers of samples.

Regarding the workflow implementation, I really like how simple you've made this new option to group by weights. My only minor comment would be that additional documentation could help developers and users better understand how to generate these weights and use them in the workflow. For developers, a small section in the ncov README could be enough. For users, an entry in the config reference guide would be most useful.

Altogether, this is excellent, though! Thank you for the hard work you've put into pushing this along over many many months!

scripts/developer_scripts/get_population_weights Outdated Show resolved Hide resolved
defaults/population_weights.tsv Outdated Show resolved Hide resolved
@trvrb
Copy link
Member

trvrb commented Aug 2, 2024

I really like where this is going! However, I ran into an error when trying to run this locally. In this filter step:

augur filter --metadata nextstrain-ncov-private/100k/metadata.tsv.xz
  --include defaults/include.txt
  --exclude defaults/exclude.txt
  --max-date 6M
  --exclude-where 'region!=Asia'
  --group-by country year month
  --group-by-weights defaults/population_weights.tsv
  --subsample-max-sequences 700
  --output-strains results/asia_6m/sample-asia_early.txt
WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'year', 'month'. Using equal weights across values in those columns.
ERROR: 27 groups appear in the metadata but are missing from the weights file. Re-run with --output-group-by-missing-weights to continue.

If I then run with adding --output-group-by-missing-weights missing-weights.txt it completes successfully with standard out:

WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'month', 'year'. Using equal weights across values in those columns.
WARNING: 27 groups appear in the metadata but are missing from the weights file. Sequences from these groups will be dropped.
All missing groups added to 'missing-weights.txt'.
91656 strains were dropped during filtering
	69341 were dropped because of 'region!=Asia'
	4791 were dropped because they were later than 2024.09 or missing a date
	1 was added back because it was in defaults/include.txt
	17525 were dropped because of subsampling criteria
695 strains passed all filters

The missing weights file looks like:

country	year	month	weight
Myanmar	2020	(2020, 4)	
Myanmar	2020	(2020, 8)	
Myanmar	2021	(2021, 1)	

I can see that Myanmar is missing from population_weights.tsv. Two things:

  1. I think that ncov needs to have --output-group-by-missing-weights
  2. Double check entries in population_weights.tsv to make sure all countries in our metadata are present.

I think the behavior off erroring out is maybe not the best, but I can discuss this over in nextstrain/augur#1454.

@victorlin
Copy link
Member Author

@trvrb thanks for taking it for a spin, sorry that you ran into another error 😞

the behavior off erroring out is maybe not the best, but I can discuss this over in nextstrain/augur#1454.

I think appropriate to discuss here since your scenario is a good example.

The erroring behavior is what I intended. The weights file should be comprehensive, otherwise sequences get dropped due to lack of weights.

For this PR, that means population_weights.tsv should have weights for everything under region=Asia. I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

The proper fix here is to update the country name mapping in get_population_weights with "Burma": "Myanmar" (your 2nd point).

I think that ncov needs to have --output-group-by-missing-weights

I think differently. --output-group-by-missing-weights is meant more as an immediate workaround rather than something to use by default. If the ncov workflow uses this option by default, it's less likely that we'd catch any discrepancies since WARNINGs in output are not something we go back and check regularly.

Example scenario: right now we know population_weights.tsv is comprehensive, but sometime in the future new sequences are added under a country name that differs from what's in population_weights.tsv. This would still produce "successful" automated builds and one would only notice that those new sequences are being dropped if the Snakemake logs are inspected.

@victorlin
Copy link
Member Author

I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

I re-ran rebuild-gisaid.yml and it failed with the same error. I suspect that I ran the last trial build before pushing error handling updates so it silently succeeded by dropping all Myanmar sequences.

The error message has been improved:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
Re-run with --output-group-by-missing-weights to continue.

This confirms that Myanmar is the only country missing from population_weights.tsv. I'll push a fix to add Myanmar weight, re-run trial build, and update preview links in the PR description.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 3 times, most recently from 2f02c59 to 6c49c89 Compare August 7, 2024 18:23
@trvrb
Copy link
Member

trvrb commented Aug 8, 2024

Thanks for the feedback @victorlin. I have a couple thoughts:

  1. In decent fraction of scenarios, I believe I'm going to want a "default" value for population_weights.tsv. As a simple example, I might want to specify --group-by country and --group-by-weights weights.tsv with entries of
country weight
USA     100
default 1

where I don't want to have to semi-manually make sure to include every possible country as 1, just to be able to specify that I want sampling more intensively from the USA than other countries. In many situations you'll have a category like host where you won't be sure what exactly will come through and you can't enumerate ahead of time. If we were to add weighted sampling here https://next.nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome we'd run into random new hosts continually breaking the build. I'd want a simple default to avoid this erroring.

However, I do totally see why you'd want to require an explicit default value rather than having an implicit default weight for entries in metadata that aren't present in weights.tsv.

  1. It feels a little strange to tie "don't error out on missing values" to --output-group-by-missing-weights. They seem semantically distinct to me. By giving me the option of not erroring out by including --output-group-by-missing-weights missing-weights.txt I wanted to always include this option so that the build wouldn't error out (rather than as a one-off strategy to find and fix missing values).

My suggestion would be to just drop --output-group-by-missing-weights as an option. If there are missing values they should always be printed. I really like:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']

This also slightly slims down the very long list of augur filter command line arguments. There's already a bunch of output- options. This also stops the user from having to run the command again with an additional argument.

So, scenarios would be

  • Missing values in weights.tsv with no default set – Error out, print to standard out the missing values that need to be specified
  • Missing values in weights.tsv with default set – Proceed, but still print to standard out the missing values

@victorlin
Copy link
Member Author

@trvrb thanks for the example use case and suggestion – both were insightful. I've implemented in the Augur PR as nextstrain/augur@a00d3b5.

For the example here, the scenarios now look like:

  • ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
    - 'country': ['Myanmar']
    To fix this, either:
    (1) specify weights explicitly - add entries to 'defaults/population_weights.tsv' for the values above, or
    (2) specify a default weight - add an entry to 'defaults/population_weights.tsv' with the value 'default' for all columns
    
  • WARNING: The input metadata contains these values under the following columns that are not directly covered by 'defaults/population_weights.tsv':
    - 'country': ['Myanmar']
    The default weight of 1 will be used for all groups defined by those values.
    ...
    

Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work here. Thanks for humoring me with the requested changes. This looks good to me from ncov perspective. I'd suggest a trial build just to make sure nothing is broken. And then maybe doing a second PR to extend weighted sampling to other regions?

@victorlin victorlin mentioned this pull request Aug 14, 2024
5 tasks
@victorlin
Copy link
Member Author

I'd suggest a trial build just to make sure nothing is broken.

Rebuilding the docker image to run trial builds now. Once that's done, I'll update the links in the PR description. I'll also take a look and note if anything seems off compared to the previous run.

And then maybe doing a second PR to extend weighted sampling to other regions?

I've created an issue to track rollout of weighted sampling in this workflow: #1141

To be used for weighted sampling in a future commit.

I considered the following data sources:

- World Bank Data <https://data.worldbank.org/indicator/SP.POP.TOTL>
- IMF <https://www.imf.org/external/datamapper/NGDP_RPCH@WEO/OEMDC/ADVEC/WEOWORLD>
- CIA The World Factbook <https://www.cia.gov/the-world-factbook/field/population/country-comparison/>
- United Nations World Population Prospects <https://population.un.org/wpp/Download/Standard/CSV/>

The UN data seemed to be the most comprehensive and easy to use.
This replaces the Asia/China/India split with population-based weighted
sampling (possible in Augur version 25.3.0).

This requires changing the geographical grouping resolution from
division to country, but I assume it was only grouped by division in an
attempt to have varying group sizes per country, and that
population-based weighting is an acceptable replacement.
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from aac80c2 to dc5433f Compare August 22, 2024 18:14
@victorlin victorlin merged commit e69c37e into master Aug 22, 2024
6 of 7 checks passed
@victorlin victorlin deleted the victorlin/weighted-sampling branch August 22, 2024 18:36
@victorlin victorlin mentioned this pull request Aug 26, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants