Hi and good day KNIME community
I am working on the cleaning of numerous river water data analysis and I am really new to KNIME. I am happy if anyone can help me on how to insert math formula for the data type that has less than or more than symbols.
The symbols indicates that the value is below detection limit or above detection limit. Thus, they have their own formula to determine the exact value for that. I am really clueless on how to use the Math Rule node. I hope anyone could assist me to solve my problem.
@nuraqlima Welcome to the Forum.
What are you trying to do? How were these columns originally produced? If the data to produce these columns is in the table you can use the Math Formula node to produce the raw number and then the Rule Engine node to set these flags. Can you share some data? That would make it easier to help.
I am trying to convert the value that has the < symbols according to its respective formula. Thus, i do not know how to insert the formula in the Math Formula node. The one that i share before is the raw data that was read from the excel reader actually. I am happy if you would enlighten me more on what type of data example that i can share so that it would be easier for you to help
Below is the example of the formula that I want to insert:-
Iām sorry, weāre not communicating. Describe step by step with the screen shot data you posted exactly what you want to do. For example where does the /2 come from in your last post?
Hi @nuraqlima and @rfeigel , ok Iāve read and now re-read this and I think I understand now what is being asked.
Firstly, with the aid of OCR provided by chatGPT, Iāv reproduced here the data from the screenshot. It might not be entirely accurate, but itās good enough for our purposes:
BOD (mg/L)
BOD INDEX
Ca (mg/L)
Cd (mg/L)
Cl (mg/L)
COD (mg/L)
COD INDEX
Cr (mg/L)
CN (mg/L)
E. coli (cfu/100ml)
Fe (mg/L)
5
79
19.8
<0.001
8
24
70
<0.001
<0.050
540,000
0.74
2
92
14.0
<0.001
3
26
78
<0.001
<0.050
14,000
0.26
4
83
12.3
<0.001
5
16
32
<0.001
<0.050
460,000
0.61
1
61
27.8
<0.001
6
27
37
<0.001
<0.050
900,000
0.82
7
73
18.4
<0.001
9
36
61
<0.001
<0.050
230,000
0.69
8
69
18.9
<0.001
21
29
62
<0.001
<0.050
250,000
0.87
1
45
27.8
<0.001
27
28
61
<0.001
<0.050
280,000
0.87
6
57
30.3
<0.001
6
18
54
<0.001
<0.050
46,000
0.55
9
29
32.7
<0.001
14
28
64
<0.001
<0.050
24,000
0.48
3
73
28.7
<0.001
15
32
63
<0.001
<0.050
140,000
1.24
(@nuraqlima, this is what @rfeigel meant by actually sharing some data, so either upload a sample from Excel, or upload in text form so that it can be copied/pasted and people can then experiment with possible solutions without having to do all the prep work which takes time and effort
So, the columns Cd (mg/L) , Cr (mg/L) and CN (mg/L) may contain a numeric value but they may also contain a literal value <n or >n. The problem faced is that for this reason, these columns are represented as strings and not numerics, and so cannot be directly used within Math Formula node, and require conversion first.
My take on this āconversionā is as follows, (except that Iām not entirely sure if I got the >n correct)
For each of those columns, derive a value as below
where the cell contains
use the formula value to determine the value
notes
n
n
<n
n / 2
>n
(n / 2) +1
(is this correct, it wasnāt clear to me from the handwritten note)
So for example
cell value
formula
calculated value
10
10
<10
10 / 2
5
>10
(10 / 2) +1
6
The calculated value could then be used in any subsequent calculations, but I believe that what is being asked, is how to convert the <n and >n values to actual numeric values based on the above.
@nuraqlima , please can you confirm understanding. Iām sure that we can then give you a variety of methods to achieve this.
i guess for operator ā>ā data the formula doesnāt make sense. The final result for a >10 value canāt be bellow 10. I guess the formula should be n+1.
@nuraqlima:
the general question is what kind of calculation is done afterwards. Is it only for binning data i think it is ok. Is it for further math calculation i would be very cautious handle data in such a way.
I think you are right about the formula for >n, (ie that it canāt be below n, and I should have considered that), but I think the error is possibly just in my usage example of the formula value rather than the formula itself (albeit we donāt know what the formula actually is ).
ān=10ā was probably a poor example on my part, used just to demo the formula rather than to demonstrate the correct range of values.
If I had been using sample such as >0.001, then adding 1 would have put the result above ānā, so it is conceivable that within the expected range it is OK.
Your points are well made though, and yes further explanation is needed.
Hi @takbb, yup correct, that is what I meant , which to convert the non-numerical value to numerical value. The formula for "> "is correct which is (n/2) + 1. Also, thankyou for enlighten me on how i should share my data , I appreciate that a lot
Hi @rfeigel, the columns with the </> produced are not from other columns, it is the value itself in the respective column for the respective parameter which in the non-numerical form, that is the data which given by my supervisor.
I am sorry if my question didnāt clear enough to you
I could just use excel actually to convert the non-numerical value to numerical value which is just insert its formula right, but for my final thesis, i have to include knime workflow , therefore, I am asking . Thank you for trying to help me
Hi @morpheus, for the formula , actually I am also not really sure where it is come from as that is given by supervisor, and I havenāt research on the reason behind the formula yet, but I am sure that is certainly only just for simple calculation and not more than that Thankyou for sharing your knowlegde
As you could have an undefined number of columns that need to be calculated in this way, It uses String Manipulation (Multi-Column) to act on all the String columns. As it is not possible to automatically have String Man (mult) act only on String columns, some additional nodes are needed to pass just the String Columns to that node, and then re-build the table afterwards, so the workflow is more involved than it ought to be!
However, the primary work for the calculations is this rather horrible looking script in String Manipulation (Multi Column):
In the above ā?ā means āthenā and ā:ā means āelseā
So what this says is:
return a string such that:
if currentcolumn value is null (i.e. missing)
--then: return the currentcolumn value (i.e. leave it as missing)
--else:
----if the first character (position 0) of current column value is "<"
------then: convert everything from the second character (position 1) onwards to numeric and divide by 2
------else:
--------if the first character (position 0) of current column value is ">"
----------then: convert everything from the second character (position 1) onwards to numeric and divide by 2 and add 1
----------else: return the current column value unchanged
(further info on this syntax - see āconditional expressionsā section of String Manipulationās āundocumentedā features)
After this node, it passes the result to String to Double to convert the values to Double.
btw, this node could have also returned Double by wrapping the whole script in the toDouble() function except that this wonāt work if there is a chance you have missing values, since the toDouble() function breaks if it is passed ānullā. So it is safer to do the string to number conversion separately.