Find account number inside string

#1

I have a free form text data field that also includes account numbers. The account number is six numeric digits that seem to always start with 0. Sometimes it is in the front, sometimes it is somewhere in the middle, and sometimes it is near the end.

A field could look like any of the following:
“012345 company A deal 34”
“company A 012345 deal 34”
“company A deal 34 of 012345”

Is there a way to use a node that will scan this free form field and return those six digits? The actual free form fields are far more complicated than what I showed, just trying to illustrate what I meant by where the acct # could be.

Thanks in advance for the help,
Eric

0 Likes

#2

Hi,
This can be done using a regular expression (or regex). A regular expression is a sequence of characters that define a text search pattern. The regex for a six digit number is “(\d{6})” and the regex for a six digit number starting with zero is “(0\d{5})”. To exclude numbers with more digits, the regex is “((?<!\d)(0\d{5})(?!\d))”. This says find numbers starting with zero and containing six digits that are not directly preceded or succeeded by a digit.
Now the task is to find a node that will search a String column using a regex and return the first match. I haven’t been able to find such a node (though I haven’t searched exhaustively of late). However some time ago I did create a metanode that performs this task. See the attached workflow.
Hope this helps,
-Don
find account nums by regex.knwf (16.5 KB)

6 Likes

#3

Holy cow dude, this is amazing. Thank you so much, I can’t wait to try this out in the morning.

3 Likes

#4

I figured I should let you know that this worked perfectly on my data. Thanks for including the difference between 0/d{5} and /d{6}; turns out there were a few that did not start with 0.

I’ve been meaning to learn how to use regex for a while now, I think you spurred me to work further with them.

4 Likes

#5

I’m glad it worked for you. This is a metanode I’ve used on several occasions. Perhaps if there’s enough interest, KNIME can make this a new function in the String Manipulation and/or Column Expression nodes?

1 Like

#6

I think it would be a great idea if they did that. Seems like the sort of thing that would get plenty of use.

0 Likes

#7

Hi @dnaki,

check Metanode you created (which is nice btw) and was wondering about your idea “making this a new function in the String Manipulation and/or Column Expression nodes”. Can you explain it a bit more detailed?

Br,
Ivan

0 Likes

#8

Hi, thanks for the follow-up @ipazin !

The String Manipulation node already contains the function regexMatcher(str,regex), but that returns a boolean value if the whole string matches the given regular expression.

What we want here is to return the first substring in the text that matches a specified capturing group of a regular expression. This allows us to flexibly find substrings (like account numbers or patent numbers, etc) that match given patterns.
The use of capturing groups increases flexibility by also allowing us to optionally specify surrounding text patterns. This is useful when we’re interested in say, the last 4 digits of a social security number

possible function signatures:
findMatch(text, regex, capturingGroup)
findMatch(text, regex, capturingGroup, caseSensitive)

description:
Returns the first substring of the text that matches the specified capturing group of the regular expression. Capturing groups are specified using parentheses. The first parenthesis denotes capturing group 1.

technical notes: In my regex metanode I allow the user to specify whether the regex is case-sensitive, so the second function signature above adds a corresponding parameter.
However, another way to specify case-insensitivity is adding the prefix “(?)” to the regular expression itself. So, technically the second function is not required, it just makes it slightly easier.

In my metanode, if case-sensitive check box is checked, the java code looks like:
Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
But I haven’t tested to see whether this flag overrides the (?) prefix if specified in the regex. The flag also may impose a slight performance penalty, hence the first function signature above allows avoiding this penalty.

Thanks
Don

P.S. I’m an intermediate regex user. Perhaps any advanced users have additional comments / ideas?

0 Likes