I am having difficulty with the Regex match function under string manipulation. I have various postcodes and I want to flag those with the format “Z99 9ZZ” or “ZZ99 9ZZ” as TRUE and the others as FALSE. I have used RegExr to write this expression:
^[A-Z]{2}[0-9]{1,2}\s[A-Z0-9]{3}
Great, it works and matches the format I want and doesn’t match the format I want to exclude. However, when I put that in the RegEx match formula in the string manipulation node, all rows are false. Does Knime use a different kind of RegEx or am I missing something? I’m not too well-versed in RegEx, so it might be me! It’s just strange that it works on the testing sites but not in practice. Thank you!
your expression works for me on second example (ZZ99 9ZZ) while your first example (Z99 9ZZ) only has one letter and in expression you expect 2 letters at beginning (^[A-Z]{2}...). So maybe following expressions will do the trick for you (not a regex expert): regexMatcher( $yourColumnName$, "^[A-Z]{1,2}[0-9]{2}\\s[A-Z0-9]{3}")
Thanks Ivan. That has worked for some! Others are still coming out as False which I would expect ot be true. I tried replacing it with “^[A-Z]{2}.+” and that seems to do the trick. I therefore assume that the person who gave this to me left trailing spaces or something in most of them!
(I also realise I provided the wrong example in the first post; I meant to say ZZ99 or ZZ9, hence the {2}!)
Yes, it seems to be. I haven’t figured out why it is not working for the first formula but is for the .+ version, but at least there are some pointers here. Maybe they will also help someone else with similar issues. Thanks again!
If you are doing regex for UK postcodes, then whilst XX99 9XX and X99 9XX are the major formats, you may also possibly need to handle a few other more unusual ones:
e.g. X9X 9XX (example, the BBC Broadcasting House - where I used to work - in London - has a postcode of W1A 1AA )
The full set of UK postcode formats is
XX9X 9XX
X9X 9XX
X9 9XX
X99 9XX
XX9 9XX
XX99 9XX
The following String Manipulation regex should find all these format, and allow for whitespace at the beginning, end or middle (though it doesn’t allow whitespace within either of the two component sections of the postcode)
This requires the format to conform to the following: ^ start of line \\s* any amount of whitespace (including none) [A-Za-z]{1,2} one or two letters (don’t care about case) (?: then from within the following group, either…
option 1 → [0-9][A-Za-z]{0,1} A single digit followed by zero or one letter, ignoring case | or…
option 2 → [0-9]{2} two digits ) end of group \\s+ one or more whitespace characters [0-9][A-Za-z]{2} a single digit followed by two letters, any case \\s* any amount of whitespace (including none) $ end of line
The above regex is written to be case-insensitive but you might wish to change the letter ranges if that doesn’t suit.
That is what I was looking for, but I hope the people I am dealing with don’t live at the BBC headquarters! However, thanks for the regex; I will save this and use it in future because who knows if there is going to be a funny postcode in my area? I don’t think so, but it is good to have something which covers everything.