Looking to extract string of numbers from messy column values

Hi,

I am new to KNIME and still learning how to achieve some results. I have a column of information that is very messy, no standardization to the formatting of the contents and contains values that are variable in length and format. This column contains number strings that I would like to have in a stand-alone column, removing all of the other information. These number strings are variable in length and may or may not contain punctuation or spaces and may have a letter as part of the string (often at the end). There may also be more than one in a given cell.

I have tried the String Manipulation node using the regexReplace function using an expression like this:“(.*\\d+(?:[.,\\s]\\d+)”,“$1” but no matter how I modify this regex, I can’t get KNIME to give me results. It either gives me nothing in the new column or it simply copies what is in the column to the new column.

Can anyone help me with either a better regex and/or node to achieve this goal of extracting the string of numbers? Below are some example strings where I would like to extract the numbers:

15449L Universal Editions
144-40058 Merion Music, Inc.
6718 Peters; 6718a Peters
eSp 8323
6378 Edition Schott
U.E. 12466 LW Universal
66455 Edition Peters
Z. 7032 Editio Musica volume 1; Z. 7926 Editio Musica volume 2

Thank you! |

Hi @rsmth

Welcome to the KNIME forum. A tricky question. It would be helpful if you could provide not only some sample input data, but also the expected results in your “standalone” column.

gr. Hans

1 Like

@rsmth this is how this could work. You define a minimum length of the number you want to extract and then use this regex to split the text into several (full) numbers. If you want to take care of decimals the regex might have to be different

(?s).*?\b(\d{2,})\b(?:.*?\b(\d{2,})\b)?(?:.*?\b(\d{2,})\b)?

2 Likes

Thank you, @HansS,
In the standalone column, I am hoping for:
15449L
144-40058
6718 6718a
8323
6378
12466
66455
7032 7926
As you see there are a couple where I will would also like to have the final letter attached to the number but I should still be able to work with just the digits alone and achieve what I am looking for, as I suspect developing a regex to be able to capture the letter after the numbers too might result in more problems than it is worth.
In general, it is just the strings I am hoping for but, as mentioned and not well represented in the sample data, is that some numbers have various forms of punctuation or white space leading into or within the number, ex. B-103 or 123 456 789, that also make this challenging.

Thank you again for any thoughts you might have. I will check out the option presented by the other poster as well.

rsmth

1 Like

It would make sense to mention such challenges in the initial round so the solution might try to take that into account. Another approach could be to employ a (local) LLM and tell it to extract the numbers (or more the IDs). Key will be to describe what you want and make sure the resulting data always has the same structure (JSON). Maybe send blocks of data there.

Hi @mlauber71,
Thank you again for this solution! It does look after many of the examples I have. I know I didn’t include all of the variations around these numbers in the main request, as I didn’t even know if what I wanted was possible but would you be able to provide advice for instances where the numbers are part of a string that includes letters either at the beginning or end?

The examples are:

LMP12345

6318a

HL50396090

EJEM00005C Belwin

Is it possible to extract these as well, either as part of the solution you have given or through another method?

This seems to be the largest format group where the numbers weren’t extracted yet.
Thank you in advance for your time!

rsmth

@rsmth here you go:

(?s).*?(?<!\d)(\d{2,})(?!\d)(?:.*?(?<!\d)(\d{2,})(?!\d))?(?:.*?(?<!\d)(\d{2,})(?!\d))?

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.