Extract first 70 characters from a string with output including only full words

Ilkka · April 7, 2022, 11:50am

Hi guys,

I have a dataset where I would need to extract the first 70 characters from a column of different string lengths, but I would want the output to only have full words in it. For example if a word starts at index 69, the word would be fully included in the output even if it would exceed the limit of 70 characters.

Below is a picture of what I’m trying to achieve. Any help much appreciated!

Thanks,
Ilkka

Problem

Desired output

bruno29a · April 7, 2022, 12:04pm

Hi @Ilkka , an expression like this should do:
substr($Input$, 0, indexOfChars($Input$, " .;", 70))

You can do this via the String Manipulation.

Basically I look for the index of the first space, or period (.) or semi-colon (basically the characters that would delimit a word - you can add more), and then I do a substring up to that point.

Results:

Ilkka · April 7, 2022, 12:56pm

Perfect, thank you Bruno!

duristef · April 7, 2022, 5:24pm

@Ilkka @bruno29a , the expression won’t work in two cases

length($Input) < 70
length($Input) >= 70 but no word delimiter after the 70th char
Here’s an example (errors in Row2 and Row4)

immagine1312×246 68 KB

You could use the Column Expression node and correct the expression this way
substr(column("Input"), 0, indexOfChars(column("Input")+".", " .;", min(70,length(column("Input")))))
Result

If you prefer the String Manipulation node, I suggest a regex like this one
regexReplace($Input$, "^(.{0,70}\\w*)\\b.*$", "$1")
(same result, except that the regex discards any delimiter at the end of the string)

If you want to get rid of extra spaces and non-word characters and trasform all of them in single spaces:
regexReplace(strip(regexReplace($Input$,"[\\W_ ]+"," ")), "^(.{0,70}\\w*)\\b.*$", "$1")

In Row3 you can appreciate the different behaviour of the two regex

(NB the second regex creates problems if you have diacritical marks in your string)

bruno29a · April 7, 2022, 7:45pm

Hi @duristef , you are absolutely correct, and nice correction with the min() in the substr().

The regex (either one) is probably more complete.

Ilkka · April 8, 2022, 9:14am

Thank you @duristef!""

system · April 15, 2022, 9:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.