Data Cleansing and Integration using Knime

Hi,

I am new to Knime and have to conduct data cleansing and integration on a dataset. Can any one help with this? Links, tutorials and most importantly examples would be deeply appreciated.

If I posted in the wrong thread, I apologize sincerely.

 

Hi,

Can you be a bit more specific? KNIME allows you to do pretty much everything with the data (inventing non-existent data included :D), so the key to precise help is in knowing what exactly you need to do.

If you have to find out as you go along (nothing wrong with that), you'd be best served in trying the "Learning" link in the menu above.

Cheers
E

Hi,

Thanks for responding.  I have to select a dataset from this page:

http://webarchive.nationalarchives.gov.uk/20160105160709/http://www.ons.gov.uk/ons/about-ons/business-transparency/freedom-of-information/what-can-i-request/published-ad-hoc-data/census/housing-and-accommodation/index.html

From here, I have to pick one dataset and perform the action of data cleansing and integration on the said dataset. So can you advise about it?

Hi,

This does sound a bit like a homework assignment...? That said, you should probably pick the one you can best relate to and/or have the great interest in, and then go through the process of cleaning up roguhly described here:

https://en.wikipedia.org/wiki/Data_quality

So you'd be looking at doing something about missing values, outliers, etc. The missing values problem has a node of the same name, but outliers will need to be addressed by being filtered out against set threshold with row filters, etc. Sorry but there's no set way of doing things - this requires you to apply any knowledge you have on the subject matter of the dataset in question, or anything you can reasonably infer from the data.

Do you actually know what is meant by the "integration" task in your context? It's typically used for joining up data from differet sources and writing the result out to an analytical table - or alternatively to prepare data for a predictive model. You should probably find that out...

Cheers
E

1 Like

Hi,

In response to your PM (came out badly formatted, hence reply in public): the "include by attribute value" option is what you're looking for, combined with "use range checking". "Lower bound" set to 1 ensures positive values for integer columns, for example.

Cheers
E

I'd add a general reply into the mix: the KNIME books (beginners' luck, KNIME Essentials, etc.) do also a very good job at introducing the data manipulation features of KNIME (in particular the functioning of flow variables and loops, which are poorly documented within KNIME itself). For as little as 25 bucks, it's almost a steal :-) Then there is the examples server, which might introduce you to this or that neat little feature.