3 chars sentences analysis

I have a table with one cell that contains all the available option codes for parts of a final product. I need to analyze all the option codes for each component and suggest a combination to decrease the parts quantity. It is ok if some optional codes end up being deleted, that is the main idea: complexity reduction. The option codes are 3 chars strings such as “AY0”, “C50” and so on. So from a cell containing a few dozens of option codes I need to extract the best combination that covers the most frequent options. It seems that create a bag of words and using the TF node is an option, but I’m not sure. Does anyone have any suggestions? Thanks!

Hello @marcelo1983 and welcome to the KNIME community

Would you provide a dummy example on how your input looks like; and an expected output as well.

This will hep us on focusing the possible solutions in the right direction from the beginning.

Thanks in advance :+1:

1 Like

Hello @gonhaddock , thanks for your feedback.

Here is a sample of my data after some processing:

So in the RESTR STRING column I have all the option codes applicable to a given PART. My goal is to find the optimal option code combination that covers the higher frequent occurrences.

For example, I have all these available option codes for part 11546377: &MSB., ?, &MM1., &MM3., &JA3., &L4H&C60/C68., &L4G&C60/C68., &Q8E., &L4K&C60/C68., &L4H/L4K&C60/C68., &MSB/MSD., &C60., &C25., &JM4., &LI3/LJ1&MM1., &LI3.

My expected outcome is the “optimal” combination of these codes, let’s say: L4H C60 MM1

The punctuation might be ignored. If Knime could return 2 to 3 suggested combinations it would be amazing.

Hello @marcelo1983

You can find some insights in the following post:

If your data-set is large and looking for performance, you might prefer a regex based solution.

Your thoughts about are welcome.

P.S. This would be your starting preprocessing of codes. Quantify combinations would be the next step. A mock dataset would be great to draft a touchable workflow.

Hello @gonhaddock,

Thank you very much for the suggestions. I’m trying the preprocessing and got some issues. After a bit of struggle with the regex expressions, I managed to run the workflow using the Regex Extractor node, but is taking a long time. Actually I couldn’t finish the loop successfully yet. Here are some pictures of my data and workflow:

I’m feeding the variable loop with all the possible option codes from a DB:

And I configured the Regex Extractor to fin all the 3 chars combinations that contains alphanumeric:

An the loop never ends… I’m not sure if I did anything wrong, probably yes.

When running the Column Expressions workflow I get error messages like the ones bellow:
WARN Column Expressions 4:497 An error occurred in script 1:
Type error: Cannot get property ‘0’ of null (line 1):
column(“Available Option Codes”).match(/S4Y/)[0]
WARN Column Expressions 4:497 An error occurred in script 1:
Type error: Cannot get property ‘0’ of null (line 1):
column(“Available Option Codes”).match(/BJ5/)[0]
WARN Column Expressions 4:497 An error occurred in script 1:
Type error: Cannot get property ‘0’ of null (line 1):
column(“Available Option Codes”).match(/YEF/)[0]
WARN Column Expressions 4:497 Error occurred in script 1: Invalid use of identifier ‘E’: reserved for predefined function.
column(“Available Option Codes”).match(/62E/)[0]

Sorry for the long post. Here is some dummy data:
Available codes:

The idea as I mentioned is to find combinations that covers the most frequent ones for each row.

Thanks for all the help!

Hello @marcelo1983
I am currently working in a dummy workflow for your use case, based on my own envision for your challenge.

Feel free to explore other solutions in the meantime.


@gonhaddock , great, thank you so much for this.

Hello @marcelo1983
I worked out yesterday a dummy workflow and uploaded into KNIME Hub. The status is draft … there are missed nodes descriptions. But I think is a base to start from.

Please take a look into it.

Let me know if it meets the requirements. I tried to deploy it functional based on performance.

I used a tinny R Script aiming to estimate the possible 3 elements combination without repetitions. If R scripting is a problem (or preferred Py), we can tray to develop the task, based in pure KNIME base nodes.


P.S.- Feel free to adjust the source data tables, with a more suitable data for your use case and share it back

Hello @gonhaddock ,
This is amazing, thank you very much. I tried to follow your thinking process and, despite the advanced maneuvers for a beginner like me, I believe that I understood the flow. The R script will not be a problem but I would love to see the same processing being done with Knime nodes. The output is exactly what I was looking for, a table that shows the combination frequency for each code. I will try to use my data as an input and check the results. I will get back to you soon.
Thank you very much for the effort put on this, it helped more than I can thank you for.

1 Like

Hello @marcelo1983
Just to let you know: As you confirmed the validity of the model, I’ve updated the workflow in Hub. Its basically the same stuff, I just rearranged the node positions + adderesed with label and comments.

In another sort of subject:

This is an interesting subject, about when you come to low code vs scripting… I’ve been conceptualizing and experimenting on how to do it. I came into a KNIME solution based in nested recursive loops. I don’t know if other knimers can come out with a more efficient approach :game_die:

I won’t implement it because; the achieved concept is complex, busy, and little flexible. You might need to add a new nested loop level for each n-tuple level. In the other hand 2 lines of scripted code are suitable for any level case…

It can be developed in a component up to a limited n-tuple level. I won’t…


If I am seeing this problem correctly, then I believe that it could be simplified quite a bit using this custom Join component from @takbb

1 Like

Thanks @iCFO
@takbb 's custom components is an unexplored field for me. I will try to take a look into it; aiming to see if it’s suitable for this use case.

However the discussion has become in the combinatory preprocessing of the widcards reference table.


That is how I use it. I typically create lookup columns via the expressions node, then do a contains / wildcard join statement on those fields. I will see if I can pull off some KNIME time this weekend to throw something together.

The trickiest part of the component is that it uses a temporary H2 database to execute the SQL join statement, so some prep may be required in order to recognize or convert data types.

Hello @gonhaddock , thank you very much for the final version. So, I had some challenges with R since my work comouter have several corporate restrictions. After some configuration I managed to install the needed packages and run the workflow. I used my data and, since it is a very large sample, the analysis is still running. I’m thinking about shrink my sample and run separated analysis. I will let you know as soon as it ends.

Combination Test.knwf (1.7 MB)

Here is another possible initial setup to @gonhaddock approach without R (if R is giving you trouble). I ran out of time and didn’t get to try and do a count of the possible joins for each match. I am fairly sure there are also some alternate paths for counting all of the joins using something like the Rank, Value Lookup or the Custom Joiner… I would need more time to play with that one.

1 Like

Thank you very much @iCFO , I will give it a try with my data and check if there are time processing improvements.

Hi @iCFO
this is really good stuff, but your workflow is delivering permutations, for this use case we need combinations wo repetition.

We look for low code arrangement, that can be easily scaled for any n-tuple level solution.


1 Like

Combination Test.knwf (2.2 MB)

I think that I have a dynamic KNIME based solution for a combination matrix that removes duplications ignoring order, but you will need to verify that I understood the goal of the process. There is a single row filter (which is annotated) that allows you to change the number of values in the generated list. I went back through the forums doing some searches and couldn’t find any successful dynamic KNIME only solutions, so I couldn’t help giving it a shot. It was not as clean or easy as I had anticipated…

I essentially had to build the matrix in full size, rebuild it based on the reduced number of values, then build the possible list duplications / de duplicate. None of my shortcut tactics panned out.

Thank you all for your help on this subject, @gonhaddock solution was the best fit for this case. I’m closing the thread as solved, now I have another demand regarding text relevance analysis that might not be pertinent here. Thanks!

1 Like

Hello @iCFO
Thank you for the try. It is a complex solution, and scalability seems to be hard as well.

I was thinking before about the use of rank node in loop and generate an index matrix. The resulted workflow, I suspect, would be a complex solution as well.

Once you can develop it in two lines of code, does it justify to invest the effort?
Anyhow, this is a subject for a different forum topic if somebody needs it.


1 Like