Split the text in a loop considering boundaries of words.

New problem requiring solution and probably it is of interest to other KNIME users as well.

I have a workflow which to the above of its main task allows translation of the text with the use of Google Translate API.

Google Translate has a limit on the size of the text it translates by API, so I made a Loop which is sending the text with a limited number of characters. After I made several mistakes, e.g. it appeared that Chinese characters must be treated as two characters not like one and the text in Chinese must be 2 times shorter then in European or Arabic, I made a workflow which works almost fine. I was happy but it was a fullish happiness – I cut the text no matter boundaries of words, so mistakes appear in translation. Not too often in fact, but they are boring.

So, the question:

How to split the text with the loop to have no more then like 1000 characters to process, but in case this 1000 splits a word in half, process only the text till the end of the last complete word? Yes, obviously the next request should start not with the 1001 character but with a character following the last translated word.

Sorry for long and not very clear explanation – even in my native language it sounds a bit strange, but I hope you understand me.

Have a great day!

1 Like

Hi @DmitryIvanov76

Interesting question (as usual with your posts :- :smile:)!

This is a quick reply but please let me know if it is not clear enough to further develop it.

What I think you need is to know where the different word separators appear, i.e. create a list of blank separator indexes and then separate words based on a modulo 1000 of these indexes (and not the hard 1000 cut off). To achieve this, I suggest the following trick:

  1. use a “cell splitter” node to split your sentence into a list of words
  2. ungroup using the ungroup node
  3. with the “string manipulation” node, calculate the length of every word and add 1 (the separator has disapeared but needs to be count too :wink:)
  4. Use the “moving aggregation” node to calculate the cumulative sum of word length
  5. At this point you know when you reach the “less than but nearest” 1000 cut-off based on “cummulative sum word length” and you just need to use modulo 1000 for your words to split it into chunks of “less than but nearest” 1000 characters without chopping any word :smiley: :+1: !

The spirit of the idea is here. I’m explaining this by heart without KNIME on hands but it should work. Otherwise, please come back and I’ll explain it better :innocent:

Best wishes!

Ael

3 Likes

Hi @DmitryIvanov76

I’m back. Too excited with this question to eventually be lazy so here it is more or less the idea I suggested in previous answer:

20210610 Pikairos Split the text in a loop considering boundaries of words.knwf (124.9 KB)

Btw, it is not modulo but the math floor function one needs to use to correctly split the sentences :wink:

This piece of KNIME code could be integrated in a chunk loop if you had several sentences in your initial table.

Hope this helps.

Best wishes,

Ael

4 Likes

Maybe a coding node is also an option for you. (Note: I probably have to sort out the missing values first if the text is not exactly divisible by 1000) I used @aworker dataset as example (Thanks for that!) here and reduced the spilt to 100 words

3 Likes

Thank you very much!
Brilliant ideas - I will try both approaches tomorrow evening and report on results! :slight_smile:

3 Likes

Thanks @DmitryIvanov76 for your kind reply. Looking forward to reading your comments :wink: !

Have a great day !

Ael