I have a dataset where I would need to extract the first 70 characters from a column of different string lengths, but I would want the output to only have full words in it. For example if a word starts at index 69, the word would be fully included in the output even if it would exceed the limit of 70 characters.
Below is a picture of what I’m trying to achieve. Any help much appreciated!
Hi @Ilkka , an expression like this should do: substr($Input$, 0, indexOfChars($Input$, " .;", 70))
You can do this via the String Manipulation.
Basically I look for the index of the first space, or period (.) or semi-colon (basically the characters that would delimit a word - you can add more), and then I do a substring up to that point.
You could use the Column Expression node and correct the expression this way substr(column("Input"), 0, indexOfChars(column("Input")+".", " .;", min(70,length(column("Input")))))
Result
If you prefer the String Manipulation node, I suggest a regex like this one regexReplace($Input$, "^(.{0,70}\\w*)\\b.*$", "$1")
(same result, except that the regex discards any delimiter at the end of the string)
If you want to get rid of extra spaces and non-word characters and trasform all of them in single spaces: regexReplace(strip(regexReplace($Input$,"[\\W_ ]+"," ")), "^(.{0,70}\\w*)\\b.*$", "$1")
In Row3 you can appreciate the different behaviour of the two regex