How to do Reject expression EXCEPT in Reject Extractor

I have the following text (example):
A0630 PATTERSON - 1JAN12 TO 31DEC12

I need to extract Patterson from it by excluding certain

I used the following regex to find the pieces I want to exclude, but cannot find a way to get the remainder:
[a-zA-Z]\d{4}|[0-9]{1,2}[a-zA-Z]{3}[0-9]{2}

Can you help?
Thank you!

Hi @IrynaK , Looking at your regex attempt, it looks like the data format is pretty much in the way it is in the line you presented.

If that’s the case, would it make sense to say that you basically could do a substring up to the dash (-), and the remove the A0630? Actually you could also do a substring between the first space and the (-).

I put a workflow together that uses both methods:
image

The input is the same as you have:
image

Configuration and Output of Method 1:


image

Configuration and Output of Method 2:


image

That’s not to say that there not other ways to do this. There are most definitely other different ways to do this.

EDIT: Here’s the workflow:
Extract string based on position.knwf (8.3 KB)

2 Likes

Thank you!
The text can have or not have dashes, it is all over the place. It is a free form text. There are just some dates in different formats and ids that I have identified in regex that I need to exclude and want to see the rest of it.

Hi @IrynaK , can we get a few of the variations as example?

Do you know if what you want to extract will always be 1 word (in which case, can we extract anything between the first space and second space)?

1 Like

it will be any number of words or nothing left after the dates and the id are extracted; the text can have dates or may not, the text can have ids or may not as well.
(dont worry about the different date formats, I have already accounted for those)

J5667 want to cancel the contract
J7695 20211001-20250930 INV12356

So, the only thing that’s consistent is the “[a-zA-Z]\d{4}” at the beginning? After that, it would seem like anything goes :sweat_smile:

2 Likes

Hi @IrynaK, I agree with @bruno29a that it does sound rather open-ended. The joy of trying to discern useful information from free text!

Would it be an option maybe to remove any “word” sequence that also contains at least one digit and see what that leaves behind? Is that what you were originally aiming for?

As @bruno29a has mentioned, ideally you need a good amount of representative sample data to be able to make better suggestions or find any patterns.

As it stands it looks like every row needs a different hand crafted regex! :wink:

2 Likes

I have handled the expected formats I want to exclude. My regex says ‘A | B | C’, which finds those parts of text and Regex Extractor extracts them. But what I want is regex for ‘EXCLUDE A | B| C’. The help I am looking for is how to handle EXCLUDE in the Regex Extractor.

Hi @Irynak

I’m struggling with that in Regex Extractor too, and more generally to get a “everything except” with Regex either inside or outside of KNIME.

I was a little confused because I thought you were saying you ONLY wanted “PATTERSON” from your original sample data, but if you are happy that your original regex does what you need, then maybe using it much like @bruno29a did using String Manipulation instead to “rub out” using regex is a possibility:

regexReplace($column1$, "[a-zA-Z]\\d{4}|[0-9]{1,2}[a-zA-Z]{3}[0-9]{2}","" )

image

image

After that, maybe if you need to split on spaces or something this gives you what you need.

2 Likes

I think I get what @IrynaK is saying. “Extract” here means “remove” or “exclude”.

@IrynaK if you have been able to find what you want to extract/remove/exclude, then just replace it with empty string “”.

For example, let’s say after your regex you find the string you want to exclude, and it’s in the second column “string to extract”:
image

You just need to replace that string with “” like this:

And here’s the result in the third column:
image

Is this what you wanted?

4 Likes

Thank you all for helping! My full multi condition regex did not work in String manipulation node.
However, I have solved it differently.
I used Strings to Document node and then Regex Filter and it worked!
Thank you everyone again!

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.