Data Profiling in Knime

Hi,

Are there any existing Nodes or Extensions for profiling data to assess Data Quality?  Something along the lines of what other tools like Ataccama DQ Analyzer or Ebay's Griffin DQ Service can do.

Thanks,

JGP

José,

You can always tie in specilist tools via KNIME's extension points, but normally my idea would be to build the quality screening I need in KNIME and to ignore all the rest I don't need. The statistics node is a great starting point for this, but of course you'd expect commercial specialist DQ tools go above an beyond that. Don't know them enough to really assess this, though...

-E

If you need a quick & easy work-around, convert the index to a new (1st) column in the DataFrame:

df.reset_index(inplace = True)

then name it:

df.rename(columns = { df.columns[0]: "row_index" })

If you need to restore the row indices in a downstream Python node:

df.set_index('row_index')

how do we integrate ataccama DQ tool via knime extension

For @Prasanthsk question see here: How do i report the data quality issues in some file format like csv ,xlsx or pdf etc...

Agreed, would love to understand how (if at all, this can be done). Keen on anyone’s thoughts.

Hello @ben_westphal,

never used Ataccama so having troubles understanding how would KNIME Ataccama integration work. Can you explain it a bit more? And additionally what functionalities from it are missing in KNIME, if any? (Guessing that is the reason why you would like to use both KNIME and Ataccama?)

It’s worth mentioning that the Data Explorer node has features for data profiling in case you haven’t tried it out yet.

Br,
Ivan

2 Likes

I would have to get deeper into it… I suspect KNIME can do everything, it just needs to be built. I liked the Ataccama interface, insights available immediately. Just looking for shortcuts really! I’ll check out the Data Explorer Node, thanks.

Ben