File Reader - Problem with loading the Corona datasets

Hi there,

I never used data from the web before, so this is new to me. I´m trying to load
the Novel Coronavirus (COVID-19) Cases Data Sets
located at
https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

The download liks for the three time series look like this:
https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_19-covid-Confirmed.csv&filename=time_series_2019-ncov-Confirmed.csv

I´m trying to use the File Reader Node to load thos series into KNIME, but have no clue how to transfer the link into something, the file reader can handle.

After 2 hours of reading Forum topics and surfing the web I hope someone of You gusy
has a hint for me.

Best regards
Mat

Hi @d4t4v1z

I would read the data directly from the John Hopkins Github: https://github.com/CSSEGISandData/COVID-19

The file you linked would than be: https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv

And you can chose insert this link into the File Reader

We are providing a workflow which parses this file already on the hub. You can find it here:

It actually parses an REST API which uses this data as well.

Let me know if I can help you more

4 Likes

Hi Iris,

thanks for helping out, will try out both worlkflow options!

Mat

First of all: Thank you for that workflow. Great visualisation of the COVID-19 pandemic.

One remark to this workflow: I have seen that to some reason the last day is kicked of from the data in the node “COVID-19 overview” (done by the row filter documented as “remove last day”). So as of today yesterdays data is not visualised. Any clue why this is done?

Furthermore to the visualisation I would like to compare countries with regards to the growth rate. So I would like to align the data to (typically diferent )starting date of these countries. Would it be possible to add that feature?

All fame goes to @paolotamag he made this :slight_smile:
We need to ask him.

Hi @knimediger,

First Question: …So as of today yesterdays data is not visualized. Any clue why this is done?

This API is updating every hour by checking for new rows in yet another source: a GitHub repository maintained by Johns Hopkins University. It often happens that the last day in the dataset is missing new cases / deaths / recovered by some countries while it does not from some other. When this is the case I think the visualization of the last day is deceiving as it shows only partial data and we preferred to take it out. If you want to display also the most recent day feel free to remove the row filter.

Second Question: …So I would like to align the data to (typically different) starting date of these countries. Would it be possible to add that feature?

We are already doing this in the last line plot in the last view. Check out this twitter post to see how it looks like:

We use as the start date the first day with at least 20 cases, but if you want to change that find the row filter in the last component on top of that Line Plot node “line shifted” and change “20” to “1”.

We wrote a nice article on Towards Data Science magazine. You can find it here:

https://towardsdatascience.com/following-the-spread-of-coronavirus-23626940c125

All the best,
Paolo

5 Likes

@paolotamag Thank you for your fast support to add this feature to the visualisation.
Now it is much easier to see the pandemic bahaviour in the different countries.

But getting requests resolved creates new ideas: What about a normalisation of the cases to the country size (population)? If I think about China (1.4 Billion people) vs. Italy (60 Mio people). That should make a difference in the number of total cases, but unfortuntely it does not.

1 Like

Hi @knimediger, I did not add anything, it was already there :slight_smile:
Regarding normalization on country population I do agree it would make things more proportionate.
Feel free to:

  • Download the Workflow
  • Download a table from the internet with population of each country (I do not think it’s provided by the sources I have been using but you can easily find a csv via Google)
  • Blend this new source with the data rows in the workflow using a Joiner node on Country right before the first Component
  • Divide each double column by the value of the new column “Population” using this Math Formula (Multi Column)
  • Make sure the new table header is unchanged
  • Visualize in the line plot the new normalized data by simply rexecuting the components
  • Reshare the enhanced workflow (mentioning Nomalization in the title) on your KNIME Hub space and give us the link here!

That would be super cool!
Cheers
Paolo

3 Likes

Hello @paolotamag, thanks for the invite to contribute to this great project.

I did some research and found population data on the UN web sites data.un.org/Handlers/DownloadHandler.ashx?DataFilter=variableID:12;timeID:84;varID:3&DataMartId=PopDiv&Format=csv&c=2,4,6,7&s=_crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc

But I’ve faced some challenges.

My first problem is that I was not able to understand the license which this data is based on. So I’m not sure whether it’s allowd to use this data for this purpose.

Nevertheless I tried to follow your instructions. Thank you very much for guiding a novice.
But due to the data I’m struggeling already in the first step of joining the two tables.
The UN data is using country names which do not appear in the same way in your ISO table (e.g. the UN table uses just “Afghanistan” instead of “Afghanistan, Islamic Republic of”). I’m sure there a quite easy way to manage this issue. But I’m already at my wit’s end.

Regarding the license problem of this dataset with country population take a look here.

" Terms of Use: All data and metadata provided on UNdata’s website are available free of charge and may be copied freely, duplicated and further distributed provided that UNdata is cited as the reference."

Just add the link of UNdata in the workflow metadata using the description panel from the KNIME Analytics Platform. To learn how to do that go here and scroll to “Workflow Metadata Editor”. This way you reference them in the Hub page and you are on the right side.

Regarding the joining operation… I had the same issue to find the continent names for each country. Country names can differ quite a bit. Do not use country names then, use their codes made of 2 letters! “IT” stands for “Italy”! Join on such code columns and find another population by country table with such codes if yours does not have any!

:slight_smile: Thanks for contributing

Cheers
Paolo

2 Likes

The UN uses 3 digit numerical country codes, named M49.
These can be converted into readable data using a table that can be downloaded here:
https://unstats.un.org/unsd/methodology/m49/overview/
in that table are also the ISO-alpha-3 codes, but NOT the alpha-2 codes that most people think about as country-codes due to their use in domain extensions.

Edit: I see now that a similar table is already used in the workflow, this one stemming from datahub.io.

3 Likes

Hello,

Have you updated the workflow with the normalization?

Johnny

Hi @Johnny_Gel, I will write on this thread when I will :slight_smile:

https://forum.knime.com/t/covid-19-live-visualization-using-guided-analytics

Don’t forget to join us for the Webinar on this workflow!

Register here:

https://www.knime.com/about/events/visualizing-the-spread-ofcovid-19-pandemic-with-knime-online-apr-7-2020

1 Like

hey guys, I added normalization to the workflow and much more.

More infos in this thread!

https://forum.knime.com/t/covid-19-live-visualization-using-guided-analytics

4 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.