Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for additional datasets #135

Closed
kirienko opened this issue Nov 20, 2020 · 25 comments
Closed

Suggestions for additional datasets #135

kirienko opened this issue Nov 20, 2020 · 25 comments

Comments

@kirienko
Copy link

It seems that data for Russia, level=2 is coming from this repo which itself is not updating. Of course it's easy to say «That's not our issue but theirs.» But no.

@eguidotti
Copy link
Collaborator

Have you tried to contact them? Can you suggest other data sources we can use instead? Thanks

@kirienko
Copy link
Author

No, I didn't try. There is an open issue there. And although the repo was updated since that issue was opened, it doesn't seem to be done on regular basis (i.e. in automated fashion).

I would suggest to use this source which looks more automated. But since I'm by no means affiliated with those people, I cannot guarantee it works better.

@kirienko
Copy link
Author

Well, it turned out that the main JHU repo has actual level=2 data for Russia. I think it's the best choice.

@eguidotti
Copy link
Collaborator

Thanks for the suggestion. But unfortunately I'm not able to find the historical data for the regions in the JHU repo (see here).
I'm cross checking the data with the source you suggested so to move to this new one

@eguidotti
Copy link
Collaborator

Hi @kirienko. After speding some time I was not able to validate the data from the other repo. Moreover, I'm afraid it may be discountinued as well in the future. I agree that the best choice would be JHU since it seems there are not open governmental data for Russia (this is what they said when we tried to contact them some months ago).
I opened an issue at JHU repo to ask if they are going to add the Russian regions in the time series dataset. That would be great and easy to integrate. Otherwise, we'd need to do that by hand, but would require a lot of work.

@kirienko
Copy link
Author

kirienko commented Dec 1, 2020

Hi @eguidotti. Thank you so much for your efforts! I really appreciate your work!

@greg-minshall
Copy link

hi. i have been pulling and cleaning up the JHU data, more or less since the beginning. just having been pointed in your direction, i was thinking of converting to your data. but, it's true that the data i've been processing might have some data you want. if you want daily changes, you'd want the columns whose names end in "_Changes_1".

my github repository is here, though i wouldn't want to be the first person other than me trying to actually do a build.

if you're interested, and have any questions, please let me know. cheers!

@eguidotti
Copy link
Collaborator

Hi @greg-minshall and thanks for your message.
Do you have region-wise data for Russia? That would be quite interesting. Also, are you hosting the data you generate somewhere? I couldn't find them. Thanks!

@greg-minshall
Copy link

hosting is here. yes, there is oblast, etc., for Russia, for example, here -- assuming that's sort of what you are hoping for.

@greg-minshall
Copy link

btw: the README.org in my repo had old links; i've just changed that (in case you poke around), hopefully correctly. cheers.

@eguidotti
Copy link
Collaborator

@greg-minshall Well actually, it looks quite interesting! It seems to me the file I could easily integrate in this repository is the following: https://somenumbers.info/covid-19/csvs/coleaned.csv.gz

Just a couple of questions:

  • is the file updated daily?
  • is there a link to the cleaning performed on the JHU data?
  • do you have a title/citation for your project? I'd need to put this in the data sources if we decide to integrate it.

Thanks!

@greg-minshall
Copy link

@eguidotti coleaned is basically a cleaned up version of the the entire run of JHU "daily" files.

yes, the files are updated daily. i think, in the past six months, there have been only a few glitches, maybe one time when something changed in the JHU data.

i don't have a citation. you can just say "Greg Minshall", some such. or, a pointer to the repo.

if coleaned works, that's great, as it's the smallest file, and you'll be the closest to JHU (in terms of me messing with the data).

the cleaning performed? you can look in covid19.org in the repo (not that you'd want to), in a section colean.R. part (compute_intervals) is dealing with early times when, e.g., Australia (i think) was represented for a while by Australia, then by some of the regions, then back to the country, etc.

then (propagating, but also in colean) deals with making sure that every entity has an entry for every date after it first appears. it is very slow, doing lots of dplyr::group_by(), etc. (at some point i'd like to switch to data.table, partly in the hopes it might be faster; and, actually, it's my memory of the complexity of the cleaning code that gives me pause.)

i also drop some JHU columns

known_excludes <- c("Incidence_Rate", "Case-Fatality_Ratio", "Incident_Rate", "Case_Fatality_Ratio")

as i think i can derive those from the existing data. (though, in fact, i don't.)

there's filtering: remove duplicates, take only the last observation (Last_Update) from a given date.

i add FIPS and Iso3c columns.

there's also some textual transformations (dosed and friends -- see the file fixups.sed in the repo) on the .csv files, to deal with anomalies early on.

that seems to be about it.

@greg-minshall
Copy link

i realized in my listing of transformations i missed some bits that used to be in the file fixups.sed, but are now embedded in covid19.org, in a table csvsedtable inside the csvsed header. these mostly normalize names at the Country_Region, Province_State, and Admin2 levels; plus some fiddling with the odd FIPS.

@eguidotti
Copy link
Collaborator

Thanks @greg-minshall for the information. I have integrated the data for Russia. Let's wait a couple of hours for the workflow to complete and see if we can close this long-standing issue.

I was also interested in the recovered cases for USA. But I see only very few observations (dates) for each state. E.g. Alabama has only about 10 observations in https://somenumbers.info/covid-19/csvs/coleaned.csv.gz Is it the same in JHU data?

As far as I understand, the other files are aggregating the numbers. E.g. compute the totals for Alabama by summing up together all entries that include Alabama as the upper level in the combined key. Is that correct? At a first stage of this project, I was also aggregating the data in this way but then I noticed that it usually doesn't work. In my experience, they almost never matched with the data provided directly for the upper level. For instance, if only one city is missing in the data, the aggregated state-wise counts are downward biased. Moreover, the data released for the upper level may include travelers or cases in which it is not known the exact location. So unfortunately I won't be able to use the aggregated data.

@greg-minshall
Copy link

@eguidotti, you're welcome. i hope it helps. let me know.

i only use JHU data.

i think 'Recovered' comes from JHU's csse_covid_19_data/csse_covid_19_daily_reports_us series. i have a (very recent) "issue" in my repo to remove that series. i think they started recording that, then discontinued (probably the data wasn't reliable).

yes, you're right about the aggregation technique. i think when i originally did that work i did some verification. if i look at the JHU data now, for example, for California on 2021-04-04 (csse_covid_19_data/csse_covid_19_daily_reports/04-04-2021.csv), i don't seem to see numbers at the state level, only at the Admin2 (county) level. for Canada, i only see data for the provinces/territories, not for the country as a whole. so, in my experience, there is no "data provided directly for the upper level" (a situation which makes me happy, being a believer in second normal form :).

were you looking at these "daily reports"? or, the more often used "time series" (that's a set of data i don't use so am not familiar with).

@eguidotti
Copy link
Collaborator

@greg-minshall yes, it works and I'm going to close this issue. Thanks a lot!

were you looking at these "daily reports"? or, the more often used "time series"

Time series data

there is no "data provided directly for the upper level"

I guess that's the case for JHU. What I mean with "the data released for the upper level" are actually data that are released directly from the government for the upper level (not necessarily US, but around the world). In general, when I aggregated data from the lower levels I never got the counts provided for the upper level. Also, in many cases JHU data (aggregated or not) do not match the ones available from open governmental data. That's basically the motivation behind this repo :) We try to pull the data from the official providers whenever possible. But in many cases it is not possible, and works like yours are very useful!

@greg-minshall
Copy link

@eguidotti ah, "ground truth", or whatever the saying is. no, i decided early on that for me, JHU == Truth.

btw, i've killed off the embarrassing Recovered, et al.

also, if you ever wanted (as a backup, say, to my build process), probably producing a daily coleaned.csv file would be reasonably easy for you to do in-house (using the R script i provide).

@eguidotti
Copy link
Collaborator

Ok I downloaded your repo as a backup, but I hope everything will go smoothly. Thanks again!

@greg-minshall
Copy link

is it legal, useful, to post to a closed issue?

anyway, @eguidotti, you might look at this issue on my site.

i won't do anything about this soon, but that data set might also appeal to you (instead of my coleaned.csv). i'll be curious of your thoughts.

cheers.

@eguidotti
Copy link
Collaborator

eguidotti commented Sep 9, 2021

Hi @greg-minshall, thanks for posting this!
It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers.
I'll reopen this as a reminder for me to get it done, or maybe some volunteer shows up :)
Many thanks!

@eguidotti eguidotti reopened this Sep 9, 2021
@eguidotti eguidotti changed the title Level 2 for Russia in not updating Suggestions for additional datasets Sep 9, 2021
@greg-minshall
Copy link

It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers.

yes, i agree. cheers!

@eguidotti
Copy link
Collaborator

After months of work... it's done! The new version is available. Please see the changelog

@greg-minshall
Copy link

Emanuele, congratulations. are you still pulling from my data (for states/provinces/oblasts)? just so i can feel un-guilty if/when my builds break... :)

@eguidotti
Copy link
Collaborator

Hi Greg, I have switched to the JHU unified dataset as you suggested. Many thanks for your package and your input, it has been very useful!

@greg-minshall
Copy link

good -- enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants