Extract $ value from string

I’m a pretty new user to KNIME and having a bit of trouble using the REGEX filter. I have no programming background, so I’m slowly trying to understand it.

I’m currently trying to extract $ values from a column with both text & dollar values coming up in random places.

I tried using the regexsplitter and regexreplacer, but to no avail.

Step 1

Example of Text in Column A (DESCRIPTION)

June 16, 2018 Janitorial Consumables = 24 Jumbo Toilet Tissue; 24 Hand Towel Hard Roll; = $3,500.92 + $324.73 tax = $2,822.65

Required Output in Column B
$3,500.92

Step 2 - Sometimes I have $ signs coming up after text, so I would need the values beforehand as well

Example of Text in Column A (DESCRIPTION)

June 16, 2018 Janitorial Consumables = 24 Jumbo Toilet Tissue; 24 Hand Towel Hard Roll; = 3,500.92 $ + 324.73 tax $ = 2,822.65 $

Required Output in Column B
$3,500.92

I tried regexReplace($DESCRIPTION$,".*?" ,"") for Step 1 but it doesn’t seem to work.

Is there a node(s)/line that would cover both?

Thanks!

Questions:

  • If there are multiple dollar value runs in a single column cell (like in both your examples) is it the case that you only want the first one matched? (This is what you appear to be saying since column B only ever has the first one - but just verifying.)
  • In the dollar-sign trailing version, is there guaranteed to always be 1 space? no space? 2 spaces? unknown number of spaces?
1 Like
  1. I would like to get the biggest value if possible, if there’s more than one $ value in the formula description

  2. There’s no guarantee of spaces before the $ trailing version since there’s different people inputting the numbers.

1 Like

Hmm… so you won’t be able to rank order (e.g get the largest of captured groups) with regex. You could couple the regex with some logic in Java Snippet node, but this would require you to know a little bit of how to write Java.

If every column cell is guaranteed to have N dollar amount (where N == 3 in your above examples,) then you could create a regex in the Regex Split node which would provide you with N new columns of dollar amounts, then use some further nodes to tease out the highest value. For example, the regex for the prefix-notation-style with N=3 would look like:

[^\$]*\$(\d[,\d]*\.\d\d)[^\$]*\$(\d[,\d]*\.\d\d)[^\$]*\$(\d[,\d]*\.\d\d)[^\$]*

and results in Regex Split node output like:

4 Likes

Thanks! That works perfect for the $ sign pre-value.

Is there a regex split for if the $ sign trails the value as well?

Unfortunately you appear to have people doing data entry like 777.12 some text for some reason $ like 324.73 tax $ in your example. Because of this, i can see no clear regex that will solve your problem - but maybe someone who is more adept at regexes might see one.

1 Like

Sounds good! Thanks for all your help!

2 Likes

Hi,

Use the expression below in a String Manipulation node:
removeChars(regexReplace($column1$, ".*(((?<=.*\\= ?)(?: *)\\d.*?(?= *\\$ *\\+))|((?<=.*\\= *\\$ *)\\d.*?(?= *\\+))).*", "\\$$1"), " ")

You can remove the bolded part (\\$) to remove the dollar sign from the output. I added this to provide exactly the same output as what you asked for.

:blush:

P.S.

A+B=C => C >= A AND B (we cannot have negative values). So it does not make sense to me to say that you want the greatest value among the values since it’s always C.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.