Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill in more lat/long data from the OpenStreetMap name-to-location API #920

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jgadling
Copy link
Contributor

@jgadling jgadling commented Apr 15, 2022

Increase coverage for default location to latlong mapping

After running all GISAID samples through ncov's location translation pipeline, our system tries associate a lat/long with each location in the resulting list. The original lat_longs.tsv file maps approximately 50% of locations to a lat/long.

Hoping to increase coverage, I wrote a script to fetch a lat/long from the OpenStreetMap name-to-location search API and this is the resulting file. It gets us up to approximately 75% of locations with a valid lat/long.

These changes are entirely additive - no existing rows have been removed or modified, so it shouldn't produce any backwards-incompatible issues.

Related issue(s)

Fixes #
Related to #

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

@trvrb
Copy link
Member

trvrb commented Apr 15, 2022

Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy. Just as a couple examples from https://github.com/nextstrain/ncov-ingest/blob/master/source-data/gisaid_annotations.tsv:

Germany/NW-HHU-1083/2021	EPI_ISL_1346721	location	Duesseldorf # previously  (Dusseldorf Health department)
Germany/NW-HHU-3865/2021	EPI_ISL_1990663	location	Duesseldorf # previously  (Düsseldorf Health department (interpreted as patient residence))
USA/ID-BVAMC-740558/2021	EPI_ISL_3156490	location	Ada County # additional_location_info: Ada
USA/ID-BVAMC-740611/2021	EPI_ISL_3156499	location	Bonneville County # additional_location_info: Bonneville
USA/IL-C21WGS0591/2021	EPI_ISL_1965028	location	Kenosha County # previously  (Kenosha County)

Lots of issues with US counties sometimes having "County" and sometimes not.

However, we stopped trying to standardize as there was just too much human label involved relative to payoff.

Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?

@jgadling jgadling force-pushed the jgadling/geolocations branch 2 times, most recently from 7879102 to f6b605e Compare April 18, 2022 23:10
@jgadling jgadling force-pushed the jgadling/geolocations branch from f6b605e to 0dafcb1 Compare April 18, 2022 23:22
@jgadling
Copy link
Contributor Author

I've updated this PR against the latest locations db - the first one was based on an older file, sorry.

Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy.

Honestly I was unaware of the additional_location_info metadata until you mentioned it. The additional locations in this PR are currently just a brute-force mapping of country/division/location into the OSM country/state/city search fields, and recording whether we have a match.

Lots of issues with US counties sometimes having "County" and sometimes not.

Yes, and that's not even half of it - many locations in the post-ncov-filtered db have non-standard formats like Prague 1 or Butler County AL. I have a much more aggressive version of the import script that just keeps dropping words off the end of the location field.

I'm not sure what's preferred here - I can update the PR with the output of the more aggressive inclusion script, I can update it to use the additional_location_info data, or really anything else.

However, we stopped trying to standardize as there was just too much human label involved relative to payoff.

Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?

That happens somewhat but it doesn't tend to be a major problem since the OSM search tool Is fairly fussy and doesn't fix/hide/handle misspellings.

The much bigger problem here is actually the format of this TSV file - I know there's a fair amount of work put into the locations translation scripts to try to make location names globally unique, but they don't wind up being that unique in the end, so we'll have multiple rows in our DB with the same location name but different countries (ex: USA/Mississippi/Union vs Argentina/San Luis/Union), so if we're importing locations from this tsv file, we have to decide whether representing location\tUnion as the Mississipi vs Argentina makes sense for us. I think we'd be better served by including country, division, and location for every line in this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants