Data Transformation

Hi! How do I use rule engine to append data? Such as how I want to edit a number 999, to a mean number

Can you provide more detail about what you’re trying to do, preferably with some sample data?

2 Likes

Hi @Jack_Lim

Is it true that in some columns you have (high) values ​​that you want to replace with the average of the other values ​​in that column?
Since it seems that the threshold value of outliers is different in each column, and the data type can also differ, I think there is no other option than to tackle this per column.

With the Rule Engine you can make the outliers missing
afbeelding

With the Missing Value node you can impute the missing value.
afbeelding

gr. Hans


Thanks for coming to help me guys

So now I able to replace the high values but it makes all my other values missing, how do I replace the high values without jeopardizing all the other values in the column?

Frist start with the Rule Engine to make the high values missing (and keep the other values as is).

Next use the Missing Value node to impute the missing value(s) by the e.g. mean.

gr. Hans

1 Like

How do I do that? Will I have to do that column by column since I can only change one column in a rule engine?

Multi-Math.knwf (87.1 KB)

Here is a simple example of the Math Formula (Multi Column) node that you can adjust to your needs.

Edit: I added a second cool trick to use the Multi Column Math to put out Missing Values and then the Missing Values to replace them with the Mean calculation. Just in case you wanted to calculate the Mean excluding the numbers above your threshold.

If you make the Math formula result 0/0 then it outputs Missing Values

2 Likes

I think I am getting it, but when I copied your code to mine, highlighting the number I want to amend, it did not work

You will want to use “Current_Column” in those formulas instead of “waistline” so that it is flexible and points to the column that is currently being addressed in the Multi Column formula.

Example:

if($$CURRENT_COLUMN$$>99, COL_MEAN($$CURRENT_COLUMN$$), $$CURRENT_COLUMN$$)

Only use a specific column name in the formula if you want to keep that criteria hard coded as it moves through the calculation for each column.

I also added another approach that uses the “Numeric Outliers” node to select outliers which would be replaced by Missing Values. You would need to play around with the threshold to see if it would work in your instance, but it would allow you to apply a more flexible threshold setting across multiple columns in the event that the value ranges are less uniform.

Multi-Math.knwf (114.0 KB)

Just to respectfully reiterate my objective right now, I want to highlight the circled numbers here and change them to the mean of the column the circled numbers are in

The basics are pretty clear, but everything else is just guesswork on our side. We have no idea if these are sample values from a much larger dataset, or if you truly only have a 6 row table… We have no idea if this is some one off task, or if the workflow will be updated in the future with new values / changing column names / a different number of columns… We have no context as to these numbers in order to recommend the best ways to flag your outlier values. If you want the solution to be more dynamic, then you need to target these values logically instead of a one off approach of “==999” which will only work for a single column in this single short table. There are a ton of ways to do things in KNIME, but the right way is typically dictated by these unknowns.

In general, it is always advisable to upload an exported workflow that has an attempt built into it, or at least just a “Table Creator” node that has your table of values (as well as the answers to the questions above) if you want a more specifically tailored solution. Unfortunately, very few of us have the time to manually enter table information from an image when we attempt to give back to others in the community forum.

If you want to upload the table, then I would be happy to show you how my workflow could be applied to the math columns. The string column will require a separate node / if statement. The String Manipulation (Multi Column) can do this if you would need that to be able to handle a larger number of columns as well.

As a general KNIME’ing tip - I recommend pulling back and asking yourself “why” before you build something that just accomplishes a single task. KNIME is a powerful tool that allows you to build complex logic based manipulations. Step back and ask yourself “How did I know which columns needed to be addressed”, “How did I decide which values should be replaced by the MEAN”, “Would I want the MEAN to be calculated with our without that value”. Then you build your thought logic into the workflow process with an eye for it to be dynamic enough to adapt to potential changes. This is also the key to being able to re-use / share processes across workflows and apply them to different tasks.

2 Likes

Ah I see, thank you for taking your time to help me, I’m currently doing a assignment requiring a large dataset and very new to KNIME so I’m very puzzled with what I’m doing rn

2 Likes

We are happy to help. I personally always struggled with school assignments like this because they tended to just be myopic tasks that lacked the necessary context to asses the outcome or creatively problem solve. If it is a basic level KNIME intro class, then perhaps the column by column approach via the Rule Engine node @HansS mentioned above might be easiest depending on your specific instructions…

1 Like

Thanks, for my assignment I’m using KNIME to perform data cleaning before transferring it to Tableau, I most likely using @HansS method as it’s more easier and fits into my objective although I’m not also it’ll work in the end as I can only append one column with one rule engine :frowning:

1 Like

You would either need to use a Rule Engine for each column, or create a loop to run each of the columns through the Rule Engine. What was the assignment language that was used to identify the outlier values which should be changed to MEAN?