Regex Parse from String

hi all,
I am trying to parse data from a string as following:

Peanuts 550 G
Peanuts 0,4kg
Water 800 ml
Water 0,1 L

Aim: to parse out the quantity and the units related to the quantity.

I wrote the following regex to prepare for any other formats in the future to be parsed out:

[0-9][0-9][0-9].[mlMLlLmgMGgGkgKG]

However I do not seem to be able to extract only the Quantity+Unit and have it in a new column with string manipulation.

Can someone please help?
Mhanks a lot!
M

Hi @matejek88

You can go quite some different routes here.

Are the records always one word followed by the units? If that’s the case you can also use a Cell Splitter with array size 2 with a space as delimiter.

You can also go for a RegexExtractor with [0-9,]{3,}.[mlMLlLmgMGgGkgKGml]+

Or a string manipultor with regexReplace($column1$,".*(?=[0-9,]{3,}.[mlMLlLmgMGgGkgKGml]+)","")

3 Likes

It is a really good base but I had to add one thing

.(?=[0-9,]{3,}.[mlMLlLmgMGgGkgKGml]+) added an asterix between the quantity and the unit becase either no space or 4 blank space can be expected like , 500g or 499 ML.

Also, I would expect to get 1,2 KG but also 1.2 KG (comma or period delimiter)

How should I go about?

Really appreciate your time @ArjenEX

@ArjenEX

I found more interesting cases in my data.
There can be pieces, and there can be packages too as follows:

Peanut 500 G
Peanut 2 X 45 G
Peanut 2X100g
Peanut 10PCS
7 Peanutdrink 300ml (here I do not need 7!)

So I went to regexr and wrote the expression that works there for me, but when I paste it into String Man. node it is greyed out.

Could you please help why is that?
Here is the expression

\d{1,}[\d\s\w][\s\d\w]

thanks!

Hi,
In Regex the backslash is an escape character, but it also is in our String Manipulation node. So there you need to escape each backslash with another backslash. And remember to surround it with double quotes.
Kind regards
Alexander

1 Like

Thanks a lot! One more thing…how do I make the text that is not needed disappear and keep only the extracted quantities+units?

Really appreciate the quick help!

I’d recommend using regex101.com, write your code, click on Code Generator, select Java and copy the contents of final String regex to KNIME.

3 Likes

Hi,
Can you show a screenshot of your current output?
Kind regards,
Alexander

I cannot really share the output itself, but I can try to explain:

Brand name Product 500 G
Brand name Product 2X 50g

Obviously the brand name and product is changing, but I want to get rid of it and extract the quantity into another column for further manipulation.

You can use a positive lookahead which I showed earlier in this topic, only the regex was different :wink:

1 Like

Alternatively you could also extract the groups with “(…)” and refer to the one you want to replace
br

The thing is my regex captures what I want on my dataset, but the moment I apply positive lookahead, it wants to cut off part of my capture - which I obiously dont want.

Could you please give me an example how that is executed? thanks a mill!

Have a look if regexReplace($column1$,"^\\b[\\d+][\\s][\\w]+|^\\b[\\w]+[\\s]","") gives the desired result on your actual data set.

It does not work, but I think I managed to find the workaround which works on a small dataset but not in a huge, that I need to check still.

Appreciate the patience and effort folks.

@ArjenEX
I would need for a little bit more help on Regex please as some requirements has changed.

I have a table with quantities as following:
6 x 700 mg
2X250ml
500 ML
0,7 Kg
300,50 gr
250 - 300gr

Could you please help me with the regex that only parses out those? I have spent way more time on this and there is always a hickup in my expression, or I could do it but with 2 separate expression but I would like to learn how to simplify things.

Really appreciate the response.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.