I handle tables with product information; that ist, technical products described by values such as measures, weights, diameters, etc.
Depending of the kind of product (and also overall data quality), some columns are filled, others empty (see the example table, the original has a huge array of technical data columns):
product length diameter d1 diameter d2 tube1 100 2.5 ? tube2 300 2.5 ? tube2 300 3.0 ? bend1 ? 2.5 3.0 bend2 50 3.0 3.5 bend3 ? 3.5 4.5
I have successfully implemented a workflow that does the following:
Identify those attributes that are useful to differentiate variants of a class of products; therefore, the attributes …
- … must have values for all products of a class (i.e. ‘length’ is useful for tubes, but not for bends)
- … ideally are unique for each member of a class (like d1 and d2 for bends); if that is not the case:
- … two (or more) attributes in conjunction must be unique for each member (see tubes: ‘length’ and ‘d1’ each are not sufficient to differentiate all tubes, but in conjunction, they are.
This works perfectly so far: for each class I have a set of attributes that are useful for that class.
But now I want to get rid of redundant attributes, to keep the variant axis as lean as possible:
In the example that would be ‘diameter d1’ and ‘d2’ for bends: It would be sufficient to use one of them, since the second doesn’t contribute to differentiate variants any further.
My idea was to “Group By” one attribute and see, if the “UniqueCount”-aggregation of a second attribute would yield ‘1’; meaning that for each individual value of A there is only one value in B (although the values themselves can be different). So in the example, in the class “bends”, for diameter d1 compared to d2 this would be the case.
The problem is that this would have to be done separately within each class and would require nested loops to compare each attribute column with all other attribute columns.
I tried with the “Column List Loop Start”, but it seems the columns selected are nor accessible inside the loop (apart from the actually iterated column).
Also I looked into the java snippet and column expressions nodes, as to do the loop-processing, but couldn’t manage to access an array variable (containing a list of column names to test) from inside those nodes (only flow variables)
Is there any example or best practice out there to handle this “compare-all-columns-of-a-set-of-columns”-scenario?