Correlation Filter error

Hi,

I am now using of Correlation Filter node, and got an error:  ERROR Correlation Filter Unable to clone input data at port 0: Java heap space.

I've change the memory policy to write tables to disc and the knime.ini file has:  -XX:MaxPermSize=4096m

-Xmx4096m. The correlation matrix is 4640*4640.
I use knime 2.6.3 on Ubuntu 12.04
 
Thanks in advance,
 
Jiohua

How many columns do you have?

Anyway, try KNIME 2.7 - the code that passes the model from node a to node b has been changed.

Regards,

  Bernd

Thanks for your suggestion, 2.7 works for me.

Hi,

Context: using KNIME 3.1.1 and processing microarray dataset with 11288 columns.

Problem: "Correlation Filter" displays "unable to clone input data at port 1 (correlation model):null"

 

Thanks,

Chhitesh

These models are large because of the quardatic complexity - so even if the code is optimized there is still a boundary and whatever we (or others) do you will hit a wall eventually....

11k cols is large. I presume you have very few rows as otherwise runtime is also exploding?

I ran into this problem recently as well. This was a suitable solution for me:

http://www.inside-r.org/packages/cran/propagate/docs/bigcor

-Brian Muchmore

Thanks Wiswedel.

Yes, I have only 150 samples. Have been executing the nodes on the server so the runtime for correlation matrix took around 5-10 mins approx (Maybe because of the graph it plots). Don't remember the processor but have 64 gigs of memory.

What would you recommend ?

Thanks, Chhitesh

Hello Brain,

Went through bigcor, looks like it is only for correlation matrix is it ?

I was further trying to filter highly correlated dimensions.

Thanks,

Chhitesh.

This bigcor sounds interesting. They swap everything to disc (but then filtering might be slow as you have quadratic runtime on the values to find 'duplicates').

I would do the filtering by looking at chunks for columns (using a loop). Say, a thousand columns at a time and then do the correlation computation + filtering on those. You could do one last sweep on all that remain also.

Apparently an interesting data field, I mean, 150 samples with 11k variables. I wonder though whether linear correlation would be the right thing to check for with such dimensions. How accurate can these correlations be ? Wouldn't it be much easier to perform a dimension reduction à la PCA before or, even better, to analyze whether the granularity has well been defined ? 

Yes, let's talk about it.

First, what do you mean by granularity? And, how do I test it?

So, the 11K columns could have already been downsampled from 40K, for example, so there could already be significant and purposeful dimension reduction.

Also, PCA masks relationships between specific columns. For example, the relationship between PCA componet 1 and PCA componet 2 is not the same as the relationship between Gene 1 and Gene 2, although there are dimension reduction techniques that do preserve such information (e.g. CUR matrix decomposition).

Yes, agreed, how accurate are the correlations? Who knows? But if the point is correlation filtering at first than it could be a useful dimension reduction step. Let's say I have a gigantic matrix of 11K by 11K (or much more), but I am only interested in correlations of r = +-0.7    Because I have such a large matrix, before I calculate correlation p-values or whatever else for (hopefully) more accurate numbers it would be great to get rid of a few thousand columns, so I could begin by filtering at r = +-0.4 and then work further with the resulting matrix.

Also, of course, Pearson correlation has assumptions and is only appropriate for certain kinds of data distributions, which needs to be taken into account.

Finally, here is some R code for filtering using BigCor, and then bringing the reduced matrix back into a "normal" data frame. Not extensively tested, and not all of the libraries are needed, but I think it works:

library(propagate)
library(caret)
library(corrplot)
library(ffbase)
##Build huge correlation matrix
result <- bigcor(YOUR.DATA, fun = "cor", size = 2000, verbose = TRUE)
dff <- as.ffdf(result)
names(dff) <- names(YOUR.DATA)
rownames(dff) <- names(YOUR.DATA)
namelist <- list()
for (i in 1:ncol(dff)) {
##Set the correlation cutoff values
  if (((dff[,i] > 0.7 & dff[,i] < 1) | (dff[,i] < -0.7 & dff[,i] > -1)) == TRUE){
##Create a list of columns that satisfy the cutoff values
    namelist[i] <- names(dff[i])
  }
}
##Remove NAs from your column list
namelist <- namelist[!sapply(namelist,is.null)]
##Create a normal R data frame
results <- dff[c(unlist(namelist)),c(unlist(namelist))]

 

-Brian Muchmore

If correlation filtering is what you've planned to do, then that's what you've got to do. 

Granularity is the observational unit. Please refer to Hadley Wickham's Tidy Data for further details: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html