GroupBy - Covariance

Hello everyone,

I have a problem using the GroupBy Node with Aggregation Covariance.

 

I have the following dataset:

Product Lead_time Repair_time
G123 120 20.0
G555 96 15.8
G555 112 20.7
G820 70 2.4
G123 118 18.6

Now I want to determine the covariance between the columns Lead_time and Repair_time for each Product. In this small example Product G123 has the values for Lead_time: 120, 118 and the Repair_time_ 20.0, 18.6 and so on. (In reality there are thousands of values for each Product)

I thought that I could select Product as Group (in GroupBy Node) and Covariance as Aggregation (With the additional setting: In the Colum Repair_time I set the Paramater Lead_time) . That gave me the following error:

Execute failed: Computing 'Covariance' on column 'Repair_time' failed. Error: sample contains 1 observed points, at least 2 are required.

I didn't found another option for the Observation of two points.

 

Maybe there is another possible (or better fitting) solution?!

Thanks and Greetings,

David

hi there
I don't see a reason why this should not work.. if the formats of the colums are correct and with the missing option highlighted it should run.. why not try a sample set and see what happens..

Hi,

I already tried everything and nothing leads to a solution. Maybe I'm not correctly using this GroupBy Node (I just started to use KNIME).

Here some Extracts of my solution:

1. Output from Column Filter

2. Because it interprets the columns Repair_time and Lean_time as String I used Node "String to Number" with Parsing Type "Double". 2.jpg shows the Output from that (which is also the Input for the GroupBy Node)

3. Using following Options for GroupBy Node (3.1.jpg and 3.2 jpg)

 

These options gave me the Error (described in my first post).

 

Greetings

 

good morning
I was able to reproduce your error..
the cause is simple.. you need more than one dataset per product..
if you for example append your table with the cloned datas your group by node with covariance will work..

So you either add another column with a constant in it and group by this one to get the covariance in general for all products, or you add for every product more data and get the covariance for each product...

Thanks for your reply and the hint.

In my file there are over 150.000 rows for about 250 different products. So there is more than enough data for every product to determine the covariance.

I think the challenge is the group by "product" and getting the covariance for each of this "grouped" products. In other words, I need about 250 covariances (for each product one).

Maybe with the GroupBy Node this is not possible, but I didn't found another option.

Can you try grouping by product and aggregate a COUNT? Then you can check whether there are indeed at least two rows per product.

Thank you for that hint. That is working! As Output I got the 250 products with counts for repair_time and lean_time.

Do you have an idea how I can fix the problem with the covariance?

Well… you either filter out the products that have only single counts…
Or you cheat a bit and append the singels to the table so that they appear twice althogh
With the very same repair and lead time…so zero variance…
I’d prefer the first metod… and explain that the single once are out due to no variance…

Thank you.

How can I filter out rows without that information is getting lost? Because after groupby (using count) the information Lead_time and Repair_time is lost (and replaced by the count).

My idea was to use aggregation "List" in groupby. Then I get following output. (1.jpg). Before we can use node Linear Correlation we have to filter out the rows with just one element. For that I would use Rule-based Filter Row but I have problems with the correct expression language in KNIME. Something like ROWCOUNT($Repair_time$)>=1; to filter elements in the list with just 1 element is not working.

 

Hi,

In screenshot 3.2 you have added a covariance calculation for repair_time ~ lean_time and lean_time ~ repair_time. Did you mean to do this? I would guess only one covariance would be enough. Could this be causing the error produced?

 

Simon

I would use Thor’s method with group by count…
Then a simple rowfilter node on count =1…if needed another for the exeptions…
Do the group by with the multples and if you like, you append the single ones to the tabe…
But keep the cout collumn to check the relevance of the covarance…

After a few workarounds (e.g. rule-based filtering) it worked.

Many Thanks for your help!

good! congratulations!
thinking of it , I would have joined the group by count table to the original and then filter the original table on count..

what was the rulelebased expression you took?
cheers adaptagis
 

That was more or less what I did. I filtered all elements with count>1 out. So i got a list with all the products appearing only once in my file. In the rule-based filter I used the expression

$Product$ LIKE "productXYZ" => TRUE (for every product appearing once)

and excluded the TRUE matches.

 

cheers david2010