How to do Reject expression EXCEPT in Reject Extractor

IrynaK · July 12, 2021, 10:14pm

I have the following text (example):
A0630 PATTERSON - 1JAN12 TO 31DEC12

I need to extract Patterson from it by excluding certain

I used the following regex to find the pieces I want to exclude, but cannot find a way to get the remainder:
[a-zA-Z]\d{4}|[0-9]{1,2}[a-zA-Z]{3}[0-9]{2}

Can you help?
Thank you!

bruno29a · July 13, 2021, 3:07pm

Hi @IrynaK , Looking at your regex attempt, it looks like the data format is pretty much in the way it is in the line you presented.

If that’s the case, would it make sense to say that you basically could do a substring up to the dash (-), and the remove the A0630? Actually you could also do a substring between the first space and the (-).

I put a workflow together that uses both methods:

The input is the same as you have:

Configuration and Output of Method 1:

Configuration and Output of Method 2:

That’s not to say that there not other ways to do this. There are most definitely other different ways to do this.

EDIT: Here’s the workflow:
Extract string based on position.knwf (8.3 KB)

IrynaK · July 13, 2021, 3:43pm

Thank you!
The text can have or not have dashes, it is all over the place. It is a free form text. There are just some dates in different formats and ids that I have identified in regex that I need to exclude and want to see the rest of it.

bruno29a · July 13, 2021, 3:46pm

Hi @IrynaK , can we get a few of the variations as example?

Do you know if what you want to extract will always be 1 word (in which case, can we extract anything between the first space and second space)?

IrynaK · July 13, 2021, 3:53pm

it will be any number of words or nothing left after the dates and the id are extracted; the text can have dates or may not, the text can have ids or may not as well.
(dont worry about the different date formats, I have already accounted for those)

J5667 want to cancel the contract
J7695 20211001-20250930 INV12356

bruno29a · July 13, 2021, 4:10pm

So, the only thing that’s consistent is the “[a-zA-Z]\d{4}” at the beginning? After that, it would seem like anything goes

takbb · July 13, 2021, 4:18pm

Hi @IrynaK, I agree with @bruno29a that it does sound rather open-ended. The joy of trying to discern useful information from free text!

Would it be an option maybe to remove any “word” sequence that also contains at least one digit and see what that leaves behind? Is that what you were originally aiming for?

As @bruno29a has mentioned, ideally you need a good amount of representative sample data to be able to make better suggestions or find any patterns.

As it stands it looks like every row needs a different hand crafted regex!

IrynaK · July 13, 2021, 5:18pm

I have handled the expected formats I want to exclude. My regex says ‘A | B | C’, which finds those parts of text and Regex Extractor extracts them. But what I want is regex for ‘EXCLUDE A | B| C’. The help I am looking for is how to handle EXCLUDE in the Regex Extractor.

takbb · July 13, 2021, 5:46pm

Hi @Irynak

I’m struggling with that in Regex Extractor too, and more generally to get a “everything except” with Regex either inside or outside of KNIME.

I was a little confused because I thought you were saying you ONLY wanted “PATTERSON” from your original sample data, but if you are happy that your original regex does what you need, then maybe using it much like @bruno29a did using String Manipulation instead to “rub out” using regex is a possibility:

regexReplace($column1$, "[a-zA-Z]\\d{4}|[0-9]{1,2}[a-zA-Z]{3}[0-9]{2}","" )

After that, maybe if you need to split on spaces or something this gives you what you need.

bruno29a · July 13, 2021, 6:01pm

I think I get what @IrynaK is saying. “Extract” here means “remove” or “exclude”.

@IrynaK if you have been able to find what you want to extract/remove/exclude, then just replace it with empty string “”.

For example, let’s say after your regex you find the string you want to exclude, and it’s in the second column “string to extract”:

You just need to replace that string with “” like this:

And here’s the result in the third column:

Is this what you wanted?

IrynaK · July 14, 2021, 12:04am

Thank you all for helping! My full multi condition regex did not work in String manipulation node.
However, I have solved it differently.
I used Strings to Document node and then Regex Filter and it worked!
Thank you everyone again!

system · January 12, 2022, 12:04pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.