# Components of PCA dimensions?

Hello everybody,

I hope this is the right forum for this question. I am learning how to use Principal Components Analysis under KNIME but so far I cannot find the composition of the PCA dimensions. I want to use KNIME for developing QSAR, so my original dimensions are molecular descriptors correlated to some biological activity.

If I properly understand this, PCA determines a set of “natural dimensions” for the problem by linearly mixing the original dimensions given in the problem. Some reduction in the dimensionality may be achieved by this. So I am taking the “PCA dimensions” reported by the PCA module as the new, reduced dimensionality, provided by the PCA. How can I find out the composition of these new dimensions in terms of the original dimensions (the molecular descriptors)? I want to know so I can propose a physical interpretation for the variation in activity in terms of molecular structure.

Cheers

Victor

Hello Victor -

If I’m understanding what you’re asking, I think you may want to use the “PCA Compute” node instead of just the PCA node. The output of the PCA Compute node has 3 outputs - one of which is the spectral decomposition of the covariance matrix of the input data - which is I believe what you’re looking for.

Dave

Hi Victor,

let me first say that Dave (thanks for the comment) is absolutely right, the eigenvectors (columns of the spectral decomposition matrix) are the directions your original data is projected to and therefore contain the information which directions have most influence on the direction of maximal variance.
Actually these vectors describe the directions of maximal variance in your feature space, the first vector points to maximum variance, second to maximum variance orthogonal to the first and so on.
The eigenvalues describe the extend of variance so if you would try to find out which features contain most information considering all features this would be the place to look for.

Besides this, I fear that this is not what you are looking for. If I understand you correctly, you try to find out which feature combinations contain the most information with respect to the activation values.
Lets say that you have two classes of molecules - active and inactive - and you try to find out which features are the most useful to distinguish between active and inactive ones. Then LDA (Linear Discriminant Analysis) is what you are looking for.
LDA tries to project your data to one dimension (sufficient for a two class problem) such that the overall distances between members of the different classes is maximized on that axis.
So in your case you would actually need an LDA-projection for your two classes. In that case the projection matrix (here only a vector) would tell you which features are most important to distinguish the active and inactive molecules.
http://en.wikipedia.org/wiki/Linear_discriminant_analysis gives some more information on LDA and as far as I know, the paper by Fisher (see links on that page) is the original source for this kind of analysis.
We are planning to implement an LDA Node in Knime but until this is finished, you could help yourself with the R node (there is an LDA implementation in the MASS library).

I hope this helps, if not don’t hesitate to ask in more detail.

Uwe

Hello,

Thanks, Dave and Uwe, for very informative posts. I’ll look into the spectral decomposition from the PCA Compute node.

Regarding the “which feature combinations contain the most information with respect to the activation values” that is a very interesting idea. I’ll have to read up on the LDA stuff. Thanx for the pointers!

This is one of the most welcoming and informative mailing lists I’ve been to.

Best regards

Victor

I see that the post is bit old, did you add LDA to knime (which is really nice framework)?

Perhaps this article will help, it seems LDA can be performed with Weka : http://www.soe.ucsc.edu/classes/cmps242/Winter09/slides/A3d-LDA.pdf

Hi, I imported Weka plugin in Knime, so can I  perform LDA? Is there some node to do it?
Thank you so much!

To perform LDA, I tried ClassificationViaRegretion configured with LinearRegression. I have a binary target class SOGLIA_5 (about Early Warning Problem). The Node Configuration and the Node view (the output) are in the attachment. It means (for example) that feature "e" contains less information than feature der1_TP ?

My attachment:

http://dl.dropbox.com/u/7281919/Output_Node_Regression.zip

I can not attach using this forum form... This is the output error:
warning: Parameter 2 to block_class_form_alter() expected to be a reference, value given in /srv/www/htdocs/knime_tech/includes/common.inc on line 2892.

One last question: using PCA node, if the number of output PCA_dimension is greater than the initial number of features, it is not usefull to use PCA? Thanks in advance!!

However, I have to identify which features are really necessary to classify with the highest accuracy... I use AttributeSelectedClassifier in Weka plugin, but how can I configure it? Please, can you give me some suggestion?!

The AttributeSelectedClassifier needs to have a base learner assigned which is used to train and evaluate a model on a subset of features. The dataset need to contain numeric as well as one nominal column (class attribute). KNIME comes along with its own implementation for a PCA which can be found under Mining/PCA, and many more mining algorithm. If you need more from statistics, you probably also want to check out KNIME' R integration. Hope this helps getting started.

Can someone add an example of how to combine the nodes in a KNIME workflow?

I would suggest looking onto the KNIME Public Example Server and download some of the examples in the Data Mining category. I hope this helps to get started in KNIME.

Hi Victor,

I too am using PCA and other chemometrics methods in KNIME but I have resorted to using the R node and writing snippets of code in R for more control. "pcaMethods" from Bioconductor is a useful R package as it handles missing data and enables cross validation which in my humble opinion is highly advisable to assess the stability and validity of your PCA model. Using R you can extract the scores, loadings and residuals and do what you like with them. Incidentally an alternative to PCA followed by LDA is to do everything in one go by using PLS and setting the Y variable to 1 or 0 according to class membership. This then becomes PLS-Discriminant analysis.

Cheers,

Mark

Hi, I'm currently learning about PCA and I'm trying to use KNIME's nodes on PCA.  My input data feeds into PCA Compute and PCA Apply nodes.  The model (green) from PCA Compute feeds into PCA Apply and PCA Inversion.  Finally, PCA Apply feeds into PCA Inversion.

My questions are:

How do I interpret the data?

How do I know which PCA dimension is associated with its respective predictor?

How do I plot the PCA?

I got this reply from the KNIME team:

PCA is not what you need. With PCA you reduce the dimensionality of the problem but you also lose the interpretability. You can only move back and forth in the two spaces (original <-> PCA) but the PCs are not associated one-to-one with the original coordinates. Maybe you want to use the LDA (from R) or the feature elimination (from KNIME) node to see which features carry the most information.