Normalizing screening data on a plate by plate basis

gbonamy · April 23, 2009, 4:36am

Is there an easy way to Normalize data using values from the tables to do so. For instance lets assume the following structure:
PlateID;WellID;SampleType;Value
A;A01;NegativeControl;1
A;A02;NegativeControl;2
A;A03;PositiveControl;20
A;A04;PositiveControl;25
A;A05;NegativeControl;3
A;A06;NegativeControl;6
B;A01;NegativeControl;3
etc…

And use the valus from positive and negative controls to normalize each plates (from PlateID) by multiplying each values in that plate by a factro (which in my case would be the median negativeControl value/ media positiveControl Value?

Thanks,

aborg · April 23, 2009, 10:38am

Hi,

Just to be sure, you do not want something like the normalised percent inhibition instead of what you described: x*median(x_neg)/median(x_pos). May I ask in why do you prefer this kind of normalisation?
(The latter is not supported by cellHTS2, but the former is.)

Jay · April 23, 2009, 5:36pm

Hi,

If you needed to use a median for “group-based” normalization then you would have to use R, your own node, or another tool to calculate the median.

In a case where you wanted to do (x-mean(x))/sd(x) at the group level you could use the GROUP BY node to create the mean and std. dev. values by group and then join this information back to the original table and calculate the new fields with the java node for example.

I hope this helps. Does anyone have a simpler way with the existing nodes in Knime?

Best regards,

Jay

Jay · April 23, 2009, 5:37pm

Oops maybe I misunderstood but creating variables coupled with group by should get you close to where you want to go.

gbonamy · April 23, 2009, 7:29pm

Thanks for the prompt answers.

Aborg, you are correct in your interpretation of my question, but bear in mind that this is for each plates, and that our original data file will contain several hundred plates. The reason for doing this is because our screening encompasses many hundreds of plates and because interplate variation is often large. This first normalization provides a way to normalize each plates. Another strategy that we use is to normalize based on the plate median value (ie. x*median(plate_x))

Jay, I am not sure what you mean by creating variables…
I looked into creating variable coupled with groups, unfortunatelly median is not an agregation option available. We prefer to use median over mean because the mean is more susceptible to outliers. Assuming that I have grouped by “PlateID” and by “SampleType” to agregate the “Value” by mean, how do I go about dividing all the “Value” from a specific “PlateID”? Does the following approach sounds corect:

make 2 tables where I only keep the Positve and negative type
merge these 2 tables with my original table based on “PlateID” to add the median value for Positive and for negative calculated during the grouping
create a new column based on x*median(x_neg)/median(x_pos)

For the Java sniplets or any other sniplets, how can I iterate over the rows from within the sniplet?
On a tangent question, I have figured how to use the flow variables, but somehow there are no nodes to generate your own flow variables Is there a way to generate environment variables? Does nayone know how to set up a few variables in a project?

Jay · April 23, 2009, 8:05pm

Sorry for the confusing answers on my part. This solution of course will not use the median but the mean instead.

Pivot your table: rows=PlateID,columns=SampleType,aggregate=mean(Value)
Neg Pos
A 3 22.5
B 3 25

then use the java or math node to make a new column = Neg/Pos and filter out the Neg&Pos columns if you want.
RatioNegPos
A 0.13
B 0.12
(Note there was no Plate B Pos Control so I gave it a value of 25 for the example.)

Then join this new table with the aggreagate value (as the top input for the join node) and your original input table as the secon table as the join node by default uses the rowid as the join index for the first input.

RatioNegPos PlateID WellID SampleType Value
0.13 A A01 NegativeControl 1
0.13 A A04 PositiveControl 25
0.12 B A01 NegativeControl 3

Now you can use the java or math node to create Value*RatioNegPos.

I’m describing this from memory as I don’ thave Knime but that is the process I generaly use inside Knime. Hopefully this post is more helpful then my last ones.

Best regards,

Jay

Jay · April 23, 2009, 8:11pm

How big is your dataset? Can it easily fit into memory? If so you could use R for example to easily complete your task. I think that one or more of the Jython nodes may have access to the whole table as well.

gbonamy · April 23, 2009, 8:44pm

Well the dataset would probably be a few million lines so I don’t think we can fit it into memory. So with the Java Sniplet, there is no ways to itterate over the rows? If so, that is a pretty drastic limitation, which probably should be changed. In addition, do you know if you can itterate over columns without knowing there name?
Thanks,

stoeter · April 23, 2009, 9:22pm

Hi gbonamy,

very interesting question, we have similar task to do. It should be possible like this:

without looping:
-create plateID:sample(Neg/Pos)
-filter Neg/Pos
-aggreate medians of plateID:sample with R node (or use group by node for mean)
'this gives you a list of all pos/neg from all plates
-split plateID and sample
-join to initial dataset using PlateID for join (you might have to do this separately for neg/pos!?)
'this gives you the dataset with additional columns for neg/pos per plate for each value
-use math node to calculate npi

with looping:
-create loop over all plateIDs (value count(plateID) -> rowID -> loop start -> inject varianble plateID)
-filter plateID (variable)
-filter Neg/Pos
-aggreate medians of sample with R node (or use group by node for mean)
'this gives you a list of pos/neg for each single plates = 2 values
-join to plateID (in single iteration) using PlateID for join
’this gives you the data for one plate with additional columns for neg/pos per plate for each value
-use math node to calculate npi
-end loop will collect all plate data from all iterations

… last, you get rid of all unneccessary columns…

I havent tried to create a work flow like this, but I hav solved similar problems and it should be possible with eighter ways. Two things might be a bit more tricky, A) R node passing through the plateID, and B) math node doent work with variables or changing column names.

Good luck,

Martin

Jay · April 23, 2009, 9:30pm

Hi,

Other than the fact that the median is not available as an aggregation function did the above solution work for you? That is my normal way of appending “group-level” aggregates for use in calculations on the individual observations and the pivot node works nicely in your case with the single additional level to your data.

I’m not sure how much memory you have but you might see if the data fits with only 4 columns it may work. The memory and pr0cessing efficiency of the routines used thereafter may impact as well.

If I can ask why do you want to iterate over the rows? Knime “streams” one row at a time. LIke I mentioned I beleive on of the Jython nodes may offer that ability at the expense of memory? That’s just going on memory though as I’ve never used those nodes.

I’m not sure if we can references columns in that manner. What are you looking to accomplish? Write one Java snippet transformation such as taking the log of a numeric variable and then iterate over some list of numeric columns? I think you could write a node to do this. Perhaps the meta nodes could be of help as well? I’m not sure what options you have for referencing columns in the java node for example.

Perhaps someone at Knime could comment?

Best regards,

Jay

Jay · April 23, 2009, 9:34pm

Sorry I’m posting my solution again so that it doesn’t get lost in all the posts. If I understood correctly then this should be simpler then iterating for the case of using the mean.

…

Pivot your table: rows=PlateID,columns=SampleType,aggregate=mean(Value)
Neg Pos
A 3 22.5
B 3 25

then use the java or math node to make a new column = Neg/Pos and filter out the Neg&Pos columns if you want.
RatioNegPos
A 0.13
B 0.12
(Note there was no Plate B Pos Control so I gave it a value of 25 for the example.)

Then join this new table with the aggreagate value (as the top input for the join node) and your original input table as the secon table as the join node by default uses the rowid as the join index for the first input.

RatioNegPos PlateID WellID SampleType Value
0.13 A A01 NegativeControl 1
0.13 A A04 PositiveControl 25
0.12 B A01 NegativeControl 3

Now you can use the java or math node to create Value*RatioNegPos.

I’m going to defer to Knime if possible for this post and my last one right above this one.

aborg · April 23, 2009, 11:00pm

I think that if you not afraid a bit getting dig a bit to R you can easily do it with HiTS. The only think you have to do is change one of the normalisation methods (but be careful about the changed semantics later!). I guess it might be a good feature request for HiTS to enable custom normalisation/scoring methods, thanks for the idea. (Per plate normalisation is done, so this would not be a problem. The closest to your needs is currently this option: normalisation just by the negative control’s medians, without variance adjustment.) Hopefully there will not be memory issues. According to your comment about iterating over the columns… Have you checked my comment here? I tried to explain how to do it. Regarding Java Snippets: I think it is not possible to track the previous rows. For that purpose I would also recommend what Jay told you, the Jython node. Adding environment variables is like iterating over the rows of a table with one row and putting that variable to the stack. The problem with this approach is that you may not have another iteration over nodes. Actually… Have you tried the Conditional Box Plot to find the median? In my opinion the solution stoeter described is absolutely doable (without looping) using Conditional Box Plot instead of the magic with the R node. Good luck, gabor

gbonamy · April 24, 2009, 2:57am

Wao, so many answers! Thank you so much.

Jay: Your solution works, the only problem is the agregation which we need to set to median to remain consistent with our current workflow, and because it is statistically better in our case. Perhaps someone could edit the node to add this agregation method?

stoeter: I get your method, and I like it. My only problem is that I don’t know R very well, and I am not sure how to use it to agregate data. In particular, I am not sure how to load the filtered table table in R to agregate.

aborg: The boxplot solution works in conjunction of stoeter. Yes it would be great if HITS provided you with custom normalization methods, or added to the list of normalization methods.
For the iteration, I think I am more interested about itterating over rows. And I must admit that the loop functions that I found still elude me. Perhaps if you have some examples to share to itterate nodes over rows and/or col that would be great.
Finally I don’t realy speak python. I am mainly a java programmer, and I was hopping that the sniplet would provide more controls to iterate several time over a table rows. This would make develloping new modules more convenient!

Thanks again for all the help. If you have some code example to provide for a R snipplet that would be useful (in the future). But I can do without it for now.
Cheers!

niederle · April 24, 2009, 3:05pm

Hi there,

if you want to give it a shot with R (, try the following to you filtered data set (controls only):

your structure:
PlateID;WellID;SampleType;Value
A;A01;NegativeControl;1
A;A02;NegativeControl;2
A;A03;PositiveControl;20
A;A04;PositiveControl;25
A;A05;NegativeControl;3
A;A06;NegativeControl;6
B;A01;NegativeControl;3

the R-snippet:

-------------------------------------------------------------------------------------------

tapply applies a function (parameter 3) to all values (parameter 1)

of every subgroup (groups given by parameter 2)

R <- tapply(R$“Value”, R$“PlateID”, median)

-------------------------------------------------------------------------------------------

Jay · April 24, 2009, 5:23pm

Hi,

#Calculate the medians across both Plates & Sample Types
R <- tapply(R$“Value”,list(R$“PlateID”,R$“SampleType”),FUN=median)
#Returns a cross tab Columns (Neg,Positive), Rows (A,B)

#Divide Neg by Pos
R <- R[,1]/R[,2]
#This compeltes up to step 2 in the inside Knime example I listed with the means
#You can of course use R to accomplish the entire task
#I’m not sure how this return to Knime but if it returns as two columns and one row let me know
#I’m going off memory again, no apps on this machine

I hope this helps.

Jay