Trim list of words from the end of strings

mmelbaghdadi · April 13, 2022, 10:58am

Hi all!

So I have the following table:

and I want to filter the strings from the column “DESCRIPTION”, by removing everything after certain words. In the example above I want to trim the strings after the words “ATTACHING” and “OPT”. So the output would look like this:

For this, I used the node Cell Splitter, where I split the “DESCRIPTION” column on those words and then keep the first column and delete the second one. Something like this:

However, as there are a lot of words that I want to trim on, this method becomes inefficient. Any ideas for a better way to do this? Ideally, I would provide a list of words as input and based on all these words the strings would be trimmed. The workflow that I used and the input file can be found below.

B737 IPC R6 Data Extraction.knwf (42.4 KB)
00000001_001_of_325 v4.xlsx (92.2 KB)

Any help/insight would be appreciated. Thank you!

duristef · April 13, 2022, 12:45pm

hI @mmelbaghdadi ,
maybe this workflow can solve the problem. The words “ATTACHING”, “OPT” etc. must be written in a table, then they are concatenated using the “pipe” as a separator. The resulting string is used in a regexReplace expression
KNIME_temp.knwf (11.9 KB)

mmelbaghdadi · April 13, 2022, 1:43pm

@duristef thank you for your help.

I tested the workflow, it indeed worked with words. Unfortunately, when I used inputs such as: “(V1” and “(CONTINUED)”, it removed the whole string. I assume this is caused by the parenthesis. Is there a way to trim these inputs as well?

duristef · April 13, 2022, 2:27pm

it happens for two reasons

you have parentheses in your “words”. Parentheses (every kind of) have a special meaning inside regex, so you have to escape them inside the string you’re looking for, preceding them by a backslash: \( So for example “(CONTINUED)” must become “(CONTINUED)”. Same is true for other special characters, such as “?”, “*” and “.”.
you are searching for both whole words and first characters of words, and that changes the pattern to be searched for by the regexReplace expression. Try replacing it with
strip(regexReplace($Description$, "^(.*?[\b| ])("+$${Swords}$$+").*$", "$1"))

mmelbaghdadi · April 13, 2022, 3:46pm

Hi @duristef thank you for your response. I have used the new expression but for some reason it still deletes the column.
Here is your workflow (adjusted):
Trim.knwf (12.1 KB)

duristef · April 13, 2022, 3:56pm

You didn’t escape the round brackets, as I told you
immagine

this is the result

mmelbaghdadi · April 14, 2022, 7:27am

@duristef that works for now. Your time and effort are appreciated. Thank you!

system · April 21, 2022, 7:28am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.