Split the text in a loop considering boundaries of words.

New problem requiring solution and probably it is of interest to other KNIME users as well.

I have a workflow which to the above of its main task allows translation of the text with the use of Google Translate API.

Google Translate has a limit on the size of the text it translates by API, so I made a Loop which is sending the text with a limited number of characters. After I made several mistakes, e.g. it appeared that Chinese characters must be treated as two characters not like one and the text in Chinese must be 2 times shorter then in European or Arabic, I made a workflow which works almost fine. I was happy but it was a fullish happiness – I cut the text no matter boundaries of words, so mistakes appear in translation. Not too often in fact, but they are boring.

So, the question:

How to split the text with the loop to have no more then like 1000 characters to process, but in case this 1000 splits a word in half, process only the text till the end of the last complete word? Yes, obviously the next request should start not with the 1001 character but with a character following the last translated word.

Sorry for long and not very clear explanation – even in my native language it sounds a bit strange, but I hope you understand me.

Have a great day!

1 Like

Hi @DmitryIvanov76

Interesting question (as usual with your posts :- :smile:)!

This is a quick reply but please let me know if it is not clear enough to further develop it.

What I think you need is to know where the different word separators appear, i.e. create a list of blank separator indexes and then separate words based on a modulo 1000 of these indexes (and not the hard 1000 cut off). To achieve this, I suggest the following trick:

  1. use a “cell splitter” node to split your sentence into a list of words
  2. ungroup using the ungroup node
  3. with the “string manipulation” node, calculate the length of every word and add 1 (the separator has disapeared but needs to be count too :wink:)
  4. Use the “moving aggregation” node to calculate the cumulative sum of word length
  5. At this point you know when you reach the “less than but nearest” 1000 cut-off based on “cummulative sum word length” and you just need to use modulo 1000 for your words to split it into chunks of “less than but nearest” 1000 characters without chopping any word :smiley: :+1: !

The spirit of the idea is here. I’m explaining this by heart without KNIME on hands but it should work. Otherwise, please come back and I’ll explain it better :innocent:

Best wishes!

Ael

4 Likes

Hi @DmitryIvanov76

I’m back. Too excited with this question to eventually be lazy so here it is more or less the idea I suggested in previous answer:

20210610 Pikairos Split the text in a loop considering boundaries of words.knwf (124.9 KB)

Btw, it is not modulo but the math floor function one needs to use to correctly split the sentences :wink:

This piece of KNIME code could be integrated in a chunk loop if you had several sentences in your initial table.

Hope this helps.

Best wishes,

Ael

7 Likes

Maybe a coding node is also an option for you. (Note: I probably have to sort out the missing values first if the text is not exactly divisible by 1000) I used @aworker dataset as example (Thanks for that!) here and reduced the spilt to 100 words

4 Likes

Thank you very much!
Brilliant ideas - I will try both approaches tomorrow evening and report on results! :slight_smile:

4 Likes

Thanks @DmitryIvanov76 for your kind reply. Looking forward to reading your comments :wink: !

Have a great day !

Ael

Sorry for a very-very late reply – I got COVID and was unable to work for several days, then missed deadlines did not give me a chance to test proposed solutions. My apologies.
Now I am back and fully operational.
@aworker aworker – your approach works perfect with a plain text. This is certainly a solution of the problem. I had to deal with some extra challenges specific to my task (when split by space the text loses paragraphs and/or the node fails). Also (I do not know why in fact; may be something is wrong on my side) html formatted text leads to fail of the cell split node. I will have to search for the right approach to this problem…
@Daniel_Weikert Daniel_Weikert – I am sorry that I did not ask as soon as you posted your solution. Can you please share the workflow or at least the code in python that you used?

Thanks again for your time and help!

3 Likes

Hi @DmitryIvanov76

I’m really sorry you got Covid and hope you are now feeling much better and fully recovered.

Thanks for your feedback. My implementation was certainly just a beginning of a possible solution that no doubts need to be adapted to fullfil your needs, for instance if your text is html tagged as you said.

I do not understand what you mean by “(when split by space the text loses paragraphs and/or the node fails).” Paragraphs should not be lost because of space splitting or the nodes fail. What node in particular is failing? Would you need extra help, could you please share your workflow or the data where it is failling ?

Best wishes for a prompt recovery !

Ael

2 Likes

Not sure if I can find it but if so I ll do that
best wishes from my side too

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.