Extract first 70 characters from a string with output including only full words

Hi guys,

I have a dataset where I would need to extract the first 70 characters from a column of different string lengths, but I would want the output to only have full words in it. For example if a word starts at index 69, the word would be fully included in the output even if it would exceed the limit of 70 characters.

Below is a picture of what I’m trying to achieve. Any help much appreciated!



Desired output

Hi @Ilkka , an expression like this should do:
substr($Input$, 0, indexOfChars($Input$, " .;", 70))

You can do this via the String Manipulation.

Basically I look for the index of the first space, or period (.) or semi-colon (basically the characters that would delimit a word - you can add more), and then I do a substring up to that point.



Perfect, thank you Bruno!

@Ilkka @bruno29a , the expression won’t work in two cases

  • length($Input) < 70
  • length($Input) >= 70 but no word delimiter after the 70th char
    Here’s an example (errors in Row2 and Row4)

You could use the Column Expression node and correct the expression this way
substr(column("Input"), 0, indexOfChars(column("Input")+".", " .;", min(70,length(column("Input")))))

If you prefer the String Manipulation node, I suggest a regex like this one
regexReplace($Input$, "^(.{0,70}\\w*)\\b.*$", "$1")
(same result, except that the regex discards any delimiter at the end of the string)

If you want to get rid of extra spaces and non-word characters and trasform all of them in single spaces:
regexReplace(strip(regexReplace($Input$,"[\\W_ ]+"," ")), "^(.{0,70}\\w*)\\b.*$", "$1")

In Row3 you can appreciate the different behaviour of the two regex

(NB the second regex creates problems if you have diacritical marks in your string)


Hi @duristef , you are absolutely correct, and nice correction with the min() in the substr().

The regex (either one) is probably more complete.


Thank you @duristef!""

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.