compare the uniqueness of two value columns per class

roberting · March 25, 2022, 1:24pm

Hi all,

The story

I handle tables with product information; that ist, technical products described by values such as measures, weights, diameters, etc.
Depending of the kind of product (and also overall data quality), some columns are filled, others empty (see the example table, the original has a huge array of technical data columns):

product      length      diameter d1      diameter d2
tube1         100        2.5              ?
tube2         300        2.5              ?
tube2         300        3.0              ?
bend1         ?          2.5              3.0
bend2         50         3.0              3.5
bend3         ?          3.5              4.5

I have successfully implemented a workflow that does the following:
Identify those attributes that are useful to differentiate variants of a class of products; therefore, the attributes …

… must have values for all products of a class (i.e. ‘length’ is useful for tubes, but not for bends)
… ideally are unique for each member of a class (like d1 and d2 for bends); if that is not the case:
… two (or more) attributes in conjunction must be unique for each member (see tubes: ‘length’ and ‘d1’ each are not sufficient to differentiate all tubes, but in conjunction, they are.

This works perfectly so far: for each class I have a set of attributes that are useful for that class.

The Problem
But now I want to get rid of redundant attributes, to keep the variant axis as lean as possible:
In the example that would be ‘diameter d1’ and ‘d2’ for bends: It would be sufficient to use one of them, since the second doesn’t contribute to differentiate variants any further.

My idea was to “Group By” one attribute and see, if the “UniqueCount”-aggregation of a second attribute would yield ‘1’; meaning that for each individual value of A there is only one value in B (although the values themselves can be different). So in the example, in the class “bends”, for diameter d1 compared to d2 this would be the case.

The problem is that this would have to be done separately within each class and would require nested loops to compare each attribute column with all other attribute columns.
I tried with the “Column List Loop Start”, but it seems the columns selected are nor accessible inside the loop (apart from the actually iterated column).
Also I looked into the java snippet and column expressions nodes, as to do the loop-processing, but couldn’t manage to access an array variable (containing a list of column names to test) from inside those nodes (only flow variables)

Is there any example or best practice out there to handle this “compare-all-columns-of-a-set-of-columns”-scenario?
Cheers!
Robert

elsamuel · March 25, 2022, 2:22pm

There are a couple easy options that come to mind for this kind of dimensionality reduction.

The first is:

The second is to use both of the following:

roberting · March 28, 2022, 8:50am

Thanks @elsamuel ,

i must admit, that I haven’t looked too much into the statistical function of knime yet. These look good for numerical values, but I omitted in my examples, that I also have to handle columns with alphanumerical codes for i.e. colors or materials, that will also contribute to the variant differentiation.
But I think It might work if I temporarily map those alphanumerics to numbers, perform these operations and map them back to original values again.

thanks for the hint!

system · June 26, 2022, 8:50am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.