I'm currently working on a way to extract topics and their influence over a given time period out of specific Microsoft-Powerpoint files. Right now I have a preprocessed BOW, the IF*IDF value for each term and the results of the document vector node.
If I'm not mistaken, the next step would be to apply a (clustering) technique like non-negative-matrix-factorization or LDA or LSI to detect possible topics.
Could someone help me and recommend a way to do this? Is there already some kind of node which I cloud use (I found none) or should implement my own node with a framework like JML?
I would appreciate any help.
currently there is no dedicated node for LDA or LSI. To reduce dimensionality or carve out linear correlations between features (terms) of documents you can also use the PCA nodes, or alternatively apply a SVD on the document term matrix (using R) via nodes provided by the R integration plugin.
For clustering there are some nodes provided by KNIME. If you are dealing with high dimensional data (which i assume in case of texts) i recommend the Distance Matrix plugin. Compute the pairwise distances once and then use the K-Medoids or Hierarchical Clustering nodes provided by this plugin. After the distance computation you can filter out the feature (term) columns with the Column Filter node. Filtering out many columns that are not used later on increases the performance. Especially when you are dealing with a lot of columns (terms) this is very helpful. Only the distance column is required by K-Medoids or Hierarchical Clustering nodes of the Distance Matrix plugin.
thank you very much for your quick reply and your suggestions.
I'll have a look at your solution and hope that I will be able to fix my problem!
And by the way, thanks for the great work you have done so far! :)