is it possible to find and replace individual items in an array (also known as list or set),
be it in a cell or in a variable?
If there was a way to find/replace against a dictionary table - that would be extra cool?!
In the replacement nodes that I tried, array cells or variables don’t even show up to be used.
I need to differentiate between product variants by their features (columns of technical data), and I am tasked to use a) as many many features as needed, but b) as few as possible.
I have implemented everything to achieve a) (see the other post mentioned, that workflow is too complex to post), but now i need to optimize and reduce.
the hint of @elsamuel in that thread brought me to the idea to use the Linear Correlation node on my columns of product features, which gives me a list of column pairs with their correlation value (see example below).
I filter this list to get the pairs with high correlation and want to use that list as a dictionary to sort out and replace “redundant” features from my variant axis (keep only one column of these that closely correlate) .
So I thought of using that list as a dictionary on an array of my column names to say i.e.:
“If there is a value in column 'diameter_d1_max’, the I don’t need the column ‘diameter_d2_max’, because it likely doesn’t contribute to further differentiation of the product variants”
This way I would shrink the array (or list) of column names, and this list will finally become concatenated to a string that defines the product model in the receiving system; however, this is technically limited to a maximum of 5 features (or columns) per variant dimension axis.
By the way, once that works, I will have to think some mechanism to priorize which features to keep and which ones to drop in case of redundancy, but that comes later.
@roberting Maybe I’m completely wrong, but your problem sounds to me like the opposite of an anonymization one. If I understand correctly, you are looking for the (minimum) subset of attributes which maximizes the differentiation between products. Subsets of this kind are called “quasi-identifiers”, because they can distinguish most of the items (whereas a key uniquely identifies each item). If the number of attributes (columns) is not very high you could try the Anonimity Assessment Node (part of the Redfield Privacy Nodes). It calculates Distinction and Separation for each possible combination of attributes (i.e. subset). If you have k columns, that means nearly 2^(k+1) calculations, so you’d better keep the number of columns (and rows) involved in the calculation to a minimum. If you sort the resulting table by Separation (descending) and number of attributes in the subset (ascending) you can select the subset(s) with the highest Separation and the lowest number of attributes.
Here’s the output of an AAN. The columns are filtered out of a FIFA dataset, just for the example. We can see that all the subsets shown are equally good at separating the rows (in fact, they are keys), but the first 6 are those with the minimum number of attributes.
Thanks @duristef,
i will have a look at this. typically I have between 10 - 30 attribute columns, but not all of them apply to all product classes of the table. I will see how the node can cope with this.