Suggestions for additional datasets #135

kirienko · 2020-11-20T18:35:16Z

It seems that data for Russia, level=2 is coming from this repo which itself is not updating. Of course it's easy to say «That's not our issue but theirs.» But no.

The text was updated successfully, but these errors were encountered:

eguidotti · 2020-11-20T18:37:30Z

Have you tried to contact them? Can you suggest other data sources we can use instead? Thanks

kirienko · 2020-11-20T20:00:49Z

No, I didn't try. There is an open issue there. And although the repo was updated since that issue was opened, it doesn't seem to be done on regular basis (i.e. in automated fashion).

I would suggest to use this source which looks more automated. But since I'm by no means affiliated with those people, I cannot guarantee it works better.

kirienko · 2020-11-23T12:37:24Z

Well, it turned out that the main JHU repo has actual level=2 data for Russia. I think it's the best choice.

eguidotti · 2020-11-25T01:13:29Z

Thanks for the suggestion. But unfortunately I'm not able to find the historical data for the regions in the JHU repo (see here).
I'm cross checking the data with the source you suggested so to move to this new one

eguidotti · 2020-11-30T17:48:37Z

Hi @kirienko. After speding some time I was not able to validate the data from the other repo. Moreover, I'm afraid it may be discountinued as well in the future. I agree that the best choice would be JHU since it seems there are not open governmental data for Russia (this is what they said when we tried to contact them some months ago).
I opened an issue at JHU repo to ask if they are going to add the Russian regions in the time series dataset. That would be great and easy to integrate. Otherwise, we'd need to do that by hand, but would require a lot of work.

kirienko · 2020-12-01T11:06:01Z

Hi @eguidotti. Thank you so much for your efforts! I really appreciate your work!

greg-minshall · 2021-04-18T13:38:38Z

hi. i have been pulling and cleaning up the JHU data, more or less since the beginning. just having been pointed in your direction, i was thinking of converting to your data. but, it's true that the data i've been processing might have some data you want. if you want daily changes, you'd want the columns whose names end in "_Changes_1".

my github repository is here, though i wouldn't want to be the first person other than me trying to actually do a build.

if you're interested, and have any questions, please let me know. cheers!

eguidotti · 2021-04-18T13:51:14Z

Hi @greg-minshall and thanks for your message.
Do you have region-wise data for Russia? That would be quite interesting. Also, are you hosting the data you generate somewhere? I couldn't find them. Thanks!

greg-minshall · 2021-04-18T14:11:53Z

hosting is here. yes, there is oblast, etc., for Russia, for example, here -- assuming that's sort of what you are hoping for.

greg-minshall · 2021-04-18T14:14:52Z

btw: the README.org in my repo had old links; i've just changed that (in case you poke around), hopefully correctly. cheers.

eguidotti · 2021-04-18T16:53:50Z

@greg-minshall Well actually, it looks quite interesting! It seems to me the file I could easily integrate in this repository is the following: https://somenumbers.info/covid-19/csvs/coleaned.csv.gz

Just a couple of questions:

is the file updated daily?
is there a link to the cleaning performed on the JHU data?
do you have a title/citation for your project? I'd need to put this in the data sources if we decide to integrate it.

Thanks!

greg-minshall · 2021-04-18T17:25:13Z

@eguidotti coleaned is basically a cleaned up version of the the entire run of JHU "daily" files.

yes, the files are updated daily. i think, in the past six months, there have been only a few glitches, maybe one time when something changed in the JHU data.

i don't have a citation. you can just say "Greg Minshall", some such. or, a pointer to the repo.

if coleaned works, that's great, as it's the smallest file, and you'll be the closest to JHU (in terms of me messing with the data).

the cleaning performed? you can look in covid19.org in the repo (not that you'd want to), in a section colean.R. part (compute_intervals) is dealing with early times when, e.g., Australia (i think) was represented for a while by Australia, then by some of the regions, then back to the country, etc.

then (propagating, but also in colean) deals with making sure that every entity has an entry for every date after it first appears. it is very slow, doing lots of dplyr::group_by(), etc. (at some point i'd like to switch to data.table, partly in the hopes it might be faster; and, actually, it's my memory of the complexity of the cleaning code that gives me pause.)

i also drop some JHU columns

known_excludes <- c("Incidence_Rate", "Case-Fatality_Ratio", "Incident_Rate", "Case_Fatality_Ratio")

as i think i can derive those from the existing data. (though, in fact, i don't.)

there's filtering: remove duplicates, take only the last observation (Last_Update) from a given date.

i add FIPS and Iso3c columns.

there's also some textual transformations (dosed and friends -- see the file fixups.sed in the repo) on the .csv files, to deal with anomalies early on.

that seems to be about it.

greg-minshall · 2021-04-19T01:58:36Z

i realized in my listing of transformations i missed some bits that used to be in the file fixups.sed, but are now embedded in covid19.org, in a table csvsedtable inside the csvsed header. these mostly normalize names at the Country_Region, Province_State, and Admin2 levels; plus some fiddling with the odd FIPS.

eguidotti · 2021-04-19T14:37:35Z

Thanks @greg-minshall for the information. I have integrated the data for Russia. Let's wait a couple of hours for the workflow to complete and see if we can close this long-standing issue.

I was also interested in the recovered cases for USA. But I see only very few observations (dates) for each state. E.g. Alabama has only about 10 observations in https://somenumbers.info/covid-19/csvs/coleaned.csv.gz Is it the same in JHU data?

As far as I understand, the other files are aggregating the numbers. E.g. compute the totals for Alabama by summing up together all entries that include Alabama as the upper level in the combined key. Is that correct? At a first stage of this project, I was also aggregating the data in this way but then I noticed that it usually doesn't work. In my experience, they almost never matched with the data provided directly for the upper level. For instance, if only one city is missing in the data, the aggregated state-wise counts are downward biased. Moreover, the data released for the upper level may include travelers or cases in which it is not known the exact location. So unfortunately I won't be able to use the aggregated data.

greg-minshall · 2021-04-19T16:33:11Z

@eguidotti, you're welcome. i hope it helps. let me know.

i only use JHU data.

i think 'Recovered' comes from JHU's csse_covid_19_data/csse_covid_19_daily_reports_us series. i have a (very recent) "issue" in my repo to remove that series. i think they started recording that, then discontinued (probably the data wasn't reliable).

yes, you're right about the aggregation technique. i think when i originally did that work i did some verification. if i look at the JHU data now, for example, for California on 2021-04-04 (csse_covid_19_data/csse_covid_19_daily_reports/04-04-2021.csv), i don't seem to see numbers at the state level, only at the Admin2 (county) level. for Canada, i only see data for the provinces/territories, not for the country as a whole. so, in my experience, there is no "data provided directly for the upper level" (a situation which makes me happy, being a believer in second normal form :).

were you looking at these "daily reports"? or, the more often used "time series" (that's a set of data i don't use so am not familiar with).

eguidotti · 2021-04-20T08:22:23Z

@greg-minshall yes, it works and I'm going to close this issue. Thanks a lot!

were you looking at these "daily reports"? or, the more often used "time series"

Time series data

there is no "data provided directly for the upper level"

I guess that's the case for JHU. What I mean with "the data released for the upper level" are actually data that are released directly from the government for the upper level (not necessarily US, but around the world). In general, when I aggregated data from the lower levels I never got the counts provided for the upper level. Also, in many cases JHU data (aggregated or not) do not match the ones available from open governmental data. That's basically the motivation behind this repo :) We try to pull the data from the official providers whenever possible. But in many cases it is not possible, and works like yours are very useful!

greg-minshall · 2021-04-20T11:44:28Z

@eguidotti ah, "ground truth", or whatever the saying is. no, i decided early on that for me, JHU == Truth.

btw, i've killed off the embarrassing Recovered, et al.

also, if you ever wanted (as a backup, say, to my build process), probably producing a daily coleaned.csv file would be reasonably easy for you to do in-house (using the R script i provide).

eguidotti · 2021-04-20T13:18:02Z

Ok I downloaded your repo as a backup, but I hope everything will go smoothly. Thanks again!

greg-minshall · 2021-09-09T06:59:23Z

is it legal, useful, to post to a closed issue?

anyway, @eguidotti, you might look at this issue on my site.

i won't do anything about this soon, but that data set might also appeal to you (instead of my coleaned.csv). i'll be curious of your thoughts.

cheers.

eguidotti · 2021-09-09T13:25:40Z

Hi @greg-minshall, thanks for posting this!
It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers.
I'll reopen this as a reminder for me to get it done, or maybe some volunteer shows up :)
Many thanks!

greg-minshall · 2021-09-09T14:58:32Z

It looks quite interesting to me. Not only for the data itself, but also to standardize the Geospatial ID. I guess this would make much easier for users to match the data by administrative area with external providers.

yes, i agree. cheers!

eguidotti · 2021-11-09T16:46:22Z

After months of work... it's done! The new version is available. Please see the changelog

greg-minshall · 2021-11-10T15:44:15Z

Emanuele, congratulations. are you still pulling from my data (for states/provinces/oblasts)? just so i can feel un-guilty if/when my builds break... :)

eguidotti · 2021-11-10T17:29:10Z

Hi Greg, I have switched to the JHU unified dataset as you suggested. Many thanks for your package and your input, it has been very useful!

greg-minshall · 2021-11-10T17:42:42Z

good -- enjoy!

eguidotti mentioned this issue Jan 25, 2021

articles/doc/data #72

Closed

eguidotti closed this as completed Apr 20, 2021

eguidotti reopened this Sep 9, 2021

eguidotti changed the title ~~Level 2 for Russia in not updating~~ Suggestions for additional datasets Sep 9, 2021

eguidotti closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for additional datasets #135

Suggestions for additional datasets #135

kirienko commented Nov 20, 2020

eguidotti commented Nov 20, 2020

kirienko commented Nov 20, 2020

kirienko commented Nov 23, 2020

eguidotti commented Nov 25, 2020

eguidotti commented Nov 30, 2020

kirienko commented Dec 1, 2020

greg-minshall commented Apr 18, 2021

eguidotti commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

eguidotti commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

greg-minshall commented Apr 19, 2021

eguidotti commented Apr 19, 2021

greg-minshall commented Apr 19, 2021

eguidotti commented Apr 20, 2021

greg-minshall commented Apr 20, 2021

eguidotti commented Apr 20, 2021

greg-minshall commented Sep 9, 2021

eguidotti commented Sep 9, 2021 •

edited

Loading

greg-minshall commented Sep 9, 2021

eguidotti commented Nov 9, 2021

greg-minshall commented Nov 10, 2021

eguidotti commented Nov 10, 2021

greg-minshall commented Nov 10, 2021

Suggestions for additional datasets #135

Suggestions for additional datasets #135

Comments

kirienko commented Nov 20, 2020

eguidotti commented Nov 20, 2020

kirienko commented Nov 20, 2020

kirienko commented Nov 23, 2020

eguidotti commented Nov 25, 2020

eguidotti commented Nov 30, 2020

kirienko commented Dec 1, 2020

greg-minshall commented Apr 18, 2021

eguidotti commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

eguidotti commented Apr 18, 2021

greg-minshall commented Apr 18, 2021

greg-minshall commented Apr 19, 2021

eguidotti commented Apr 19, 2021

greg-minshall commented Apr 19, 2021

eguidotti commented Apr 20, 2021

greg-minshall commented Apr 20, 2021

eguidotti commented Apr 20, 2021

greg-minshall commented Sep 9, 2021

eguidotti commented Sep 9, 2021 • edited Loading

greg-minshall commented Sep 9, 2021

eguidotti commented Nov 9, 2021

greg-minshall commented Nov 10, 2021

eguidotti commented Nov 10, 2021

greg-minshall commented Nov 10, 2021

eguidotti commented Sep 9, 2021 •

edited

Loading