MDS (DistMatrix) node Missing Values Output

Hi,

I have been trying out the MDS (DistMatrix) to convert a Distance Matrix column into a lower dimensional space with 2 or 3 columns (i.e. MDS Col 1, MDS Col 2, MDS Col3). However, my output consists mainly of missing value cells in these columns, there are just a small number of rows with values present.

I have tried changing the Epochs, Output Dimensions (1-3) without success.

I have also generated different Distance Matrix's (one use numerical columns, the other generated from an RDkit fingerprint) but the result is always the same.

The workflow I am using and observing this problem with, is the 050006 ComplexSAR Analysis from the KNIME Public Server. I am using the dataset straight from the first node, the SDF REader node.

Any ideas?

Simon.

Is this bug reproducible ?

Simon.

Hi Simon,

i guess i see the point. The MDS nodes ignore rows that contain missing values, and produce missing values as well. The two "Solubility..." columns contain a lot of missing values, resulting in a lot of missing MDS values. The MDS values will be computed properly if you only keep only the Molecule column, the RDKit Fingerprint column, and the Distance column (with Tanimoto distances).

To compute e.g. euclidean distances on the numerical features the "Solubility..." columns need to be filtered before as well, in order to compute proper distances (and later on for the MDS node).

Attached you find a small example workflow with MDS values computed on the data set of the SDF Reader node of the "050006_ComplexSARAnalysis" workflow. I hope this helps.

Kilian

Hi Kilian,

Thanks for looking into this, and for spending the time to build a workflow example. From your example, I can see that the missing values in the solubility columns make all the difference!

Maybe I misunderstand how the MDS (DistMatrix) node works, but I assumed that the node is purely using the DistanceMatrix column ONLY to generate the the lower dimension data columns, and therefore missing values in the solubility columns should make no difference. Can you clarity whether this is the case or not ?

I know that the MDS node (in Mining section) uses the numerical columns to generate the lower dimensional data, but assumed the MDS (DistMatrix) version was different in that the numerical columns were not used, and only the Distance Matrix column is used.

If the MDS node is using other data besides the Distance Matrix column, can a tick box be added into the node to select whether you want to use the Distance Matrix column only.

 

Thanks,

Simon.

 

You are right Simon, the "MDS (DistMatrix)" node should not be affected by missing values, which are not in the distance column. The node should only work on the distance values and ignore the rest of the columns. In other words this is a misbehavior.

Kilian

Hi Simon, Hi Kilian! The problem will be fixed with new KNIME version 2.7.0 coming out in December 2012.