Substrings of strings

Hi everybody !

 

I’m new in Knime and I can’t find the solution for one of my manipulation.

I have strings of word (example below):

billet maniel dorin
 

The objective is to find all possible combinations of words with a minimum of 2 words and a maximum of the entire string, keeping the order of the string.etc

Below all possible combinations with the string of 3 words "billet maniel dorin" :

  1. billet maniel
  2. billet dorin
  3. maniel dorin
  4. billet maniel dorin

How to manage to do it on knime ?

Thanks in advance for your reply.

I guess that you have to solve in Java within a Java Snippet.

Thank you for your reply Spider.

Anyone has a suggestion concerning the code within the Java node ?

Thanks for your reply

How about ngram creator? You’ll have to use several and concatenate their results though.

Hi Geo,

Thank you for your great idea. I tried to put a Ngram Creator in order to create my words combination.

Nevertheless, it creates only combinations of words without skipping a word. Bellow all results from the Ngram Creator for the string of 3 words " Billet maniel dorin" :

Billet maniel

maniel dorin

The combination  " billet dorin " is missing.

How to manage to do this ?

Thanks in advance for your reply !

 

So this means that ngram is not the appropriate solution, for they take in account only the adjacent words.

Here a more complex solution, which will work:

  • create two additional empty string columns with Constant Value;
  • Strings to Document, using the two empty columns for the authors and the fulltext, your actual feeding the title;
  • Bag of Words Creator;
  • Group Loop Start with Document as group;
  • Cross-joiner with top and bottom port having exactly the same source;
  • Rule-based Row Filter, excuding false rows
$Term$ = $Term (#1)$ => FALSE
TRUE => TRUE
  • Loop End

Now you have all combinations. Use Term To String on both term columns. Now the only challenge left to you is getting rid of the redundant combinations: e.g. billet maniel vs maniel billet. Probably something can be done with an unpivot-pivot strategy or with a GroupBy ...

Thank you so much Geo it works : I have all combinations by pair :) !

But what I need is to find all possible combinations of words with a minimum of 2 words and a maximum of the entire string, keeping the order of the words. 

For example we have the following string : " billet maniel morin black"

The expected result is the following list :

  1. billet maniel
  2. billet morin
  3. billet black
  4. maniel morin
  5. maniel black
  6. morin black
  7. billet maniel morin
  8. billet maniel black
  9. billet morin black
  10. maniel morin black
  11. billet maniel morin black

The order is retained and words are grouped by 2; 3 and 4 (4 is the length of the initial string). The same model must be reproduced for strings with a length of 3; 6 or 8 ...

Thanks in advance for your help