Sample data (repositories) for teaching

stelfrich · April 3, 2019, 7:21am

Dear all,

We had several lively discussion at KNIME Spring Summit about sample data that can be used for teaching. Those discussions have revealed a great need of the educators community.

Since I personally don’t think that starting another repository for sample data is a great idea, I’d like to propose coming up with a selection of resources that you use when teaching data science (which should somehow also be importable into KNIME). Chime in if you know of any resources and I will compile them into a list at the end of this post.

Best,
Stefan

mlauber71 · April 3, 2019, 9:22pm

This is a very good idea. For Time series there are two sources (previously hosted by Rob J Hyndman)

https://datamarket.com/data/list/?q=provider:tsdl
(unfortunately I just read that DataMarket has been acquired by Qlik and seemingly access will be restricted)

ScottF · April 4, 2019, 12:56am

I believe Michael mentioned data.world during the academics meeting. They have a nice archive of freely available data organized by type - for example, finance, census, sports, and several others. They also have an integration with KNIME, which is handy.

stelfrich · April 4, 2019, 10:08am

Thanks for clearing up my personal confusion! I was under the impression that data.world is more on the private data deposition side of things…

I have silently added https://openml.org to my first post since I was super impressed by what they have build. Unfortunately, the last commit to the integration was in 2014…

DemandEngineer · October 31, 2020, 12:06am

Other than this thread… is there a way to make more accessible directly in Knime all of these sources? perhaps organized by type of use (General, Business, Bio, Chem, etc ) and type of data (time series, etc)… and if it is part of a series (Company table, contact table, activity table which have keys to join on). I’m likely hoping for too much…

mlauber71 · October 31, 2020, 1:08pm

I started a series of meta (link) collections about several topics. Of course someone could start one with links to suitable data collections like what Kaggle is offering or UCI (https://archive.ics.uci.edu/ml/datasets.php). You have to keep in mind the policies some sites are requiring about how you are allowed to use and host their data.

stelfrich · November 2, 2020, 4:21pm

Hi @DemandEngineer,

We have (internally) talked about this topic some more in the last couple of weeks. We have come up with the idea to augment datasets with descriptions of tasks that they can be used for and some sample workflows. And while our idea is more about collection challenges/small projects, the result could look pretty similar to what you are describing.

One major part of the idea is to wrap data into components so that you can easily re-use them. This is hindered, however, by the limitation that components are wiped of their data when they are added to a new workflow. This could be circumvented, though, by pulling the data from a central source (like data.world for instance).

But the general issue of licensing and sharing policies that @mlauber71 has pointed out still remains. We do, however, have some data for which we can figure out the licensing and redistribution quite quickly.

Best,
Stefan