Get 3 data using Regex in KNIME

Hello Team,

Hope someone can help… I am trying to get 3 specific data using Regex split node in KNIME.

I was able to get the the two but can’t get the quantity. Here’s the data:

Data: 0940720046 Graxa lubrificante sintetica - KLUBERSYNTH BH 72-422, 34039900 110 6656 KG 0,6000 6.046,1667 3.627,70 3.981,40 159,26 353,70 4,00 9,75

Need to get: First - 34039900, Second - KG and Third (the one not part of the code yet) - 0,6000

Here’s the code: .(\d{8})\s[0-9\s]+(KG|L|LT|Tambor|PC)\s[0-9.]+,\d+\s.

I put (KG|L|LT|Tambor|PC) because the value of unit differs but the positioning of the data is fixed. Hope someone can help get the third one. Thank you!

Hello @trafalgarlaw
You can test the following code:

(\d+)\s.+?\d+\s([KG|LT|L|Tambor|PC]+)\s(.+?)\s.+

BR

2 Likes

Hello @gonhaddock, sorry I missed it. It seems the code is getting below:

first data - 0940720046 in which should be the 34039900.

Appreciate if you could help. Thank you!

Hello @trafalgarlaw
This code is working on my side:

.+[,]\s(\d+).+?([KG|LT|L|Tambor|PC]+)\s(\S+?)\s.+

BR

PS.- last minute edition

3 Likes

Thank you @gonhaddock. It works now from my end!! :slight_smile:

Hello @gonhaddock, I’ve run through with other data and it seems the code not working for the below and not getting any results.

Data: 31081800 OEM SYNTHETIC DEO 5W30 27101932 830 6651 L 9.801,0000 13,1454 128.838,07 0, 00 0, 00 0, 00

Appreciate if you could check. Thank you!

Though - I’ve tried this code and it works! But not sure if this will be fine for all data or maybe you could suggest something to improve in this code.

.(\d{8})\s.+?\d+\s([KG|LT|L|Tambor|PC]+)\s(.+?)\s.+.

Hello @trafalgarlaw
The problem here is that sometimes regex is too literal, if casuistic is wide, then is better to test it with as many samples as possible.
Said so, in the first sample; the code was using the [,] as reinforcement to find the allocation of the first numeric sequence.

In this new version there is not a comma any more (it has been removed), now the identifier is just the first sequence of eight digits after a white space character. Otherwise you can get messed up with the starting string numeric sequence; which is eight in length as well.

This code work for both examples provided till now:

.+\s(\d{8}).+?([KG|LT|L|Tambor|PC]+)\s(\S+?)\s.+

BR

1 Like

Just one last question @gonhaddock, which one do you prefer me to use? Code 1 or 2? And appreciate if you could share some insights so it will help me as well in the future to assess which one is better. Thank you!

Code 1: .(\d{8})\s.+?\d+\s([KG|LT|L|Tambor|PC]+)\s(.+?)\s.+.

Code 2: .+\s(\d{8}).+?([KG|LT|L|Tambor|PC]+)\s(\S+?)\s.+

Hello @trafalgarlaw
There isn’t an absolute answer to your question, as it depends on the background of the user, and the data constrains (in this case, I don’t know any of them)

As a general rule the one that works for your use case, and accomplish the next bullets (not necessarily in same order):

  1. It’s more efficient
  2. It’s easier to track
  3. It’s more robust

Sometimes robustness can require less efficiency…

Said so, the Code1 just cannot work in example 1, as the sequence starts with a 10 digits length, the . at the beginning requires a quantifier (+) making it greedy. The following \s in Code2 may look redundant (less efficient), but increases in robustness, especially from my side without seen the casuistic. And so on

Just the analysis of this this part of the code, returns me back to the first paragraph.

BR

1 Like

Again thank you @gonhaddock for your insights!! :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.