Switching between tag clouds and other problems in the Topic Extraction with the Elbow Method workflow

Dear Knime experts and users,

I have used the the Topic Extraction with the Elbow Method workflow several times with different datasets and find it really amazing. However, I have come accross several problems that I have failed resolve so far.

  1. I can’t find the way to switch between the switcher bitween tag clouds in the last Tag Clouds component. For instance, I have 10 clusters but can open only the last tag cloud in the interactive view. How can I switch to other clouds in the inteactive view?

  2. In my recent investigation, I noticed that the number of clusters equals the number of separate PDF documents uploaded. Is it coincidence?

  3. The PCA node freezes at about 46% if the number of documents exeds 10. I left it for the whole night but it didn’t move further. Is it possible to work with greater number of documents?

Hello @kate_ice

I have taken a look the workflow you referred, but I guess you are using your own data for it. I will try to answer your questions.

  1. There is no way to open interactive view for all the Tag Clouds, unless you have Knime server and iteratively run the whole component for each topic. However the output of the components provide with tag clouds images for all the topics.
  2. This issue looks strange, however it could be a coincidence indeed. Perhaps your documents are very different from each other, so they are put into different clusters. Otherwise it is hard to tell without looking at the data.
  3. In general PCA is not very heavy algorithm, but everything depends on the data set you are trying to process. I do not know how big the documents you are processing, but the problem here could be that you are getting a huge dimensions for word vectors (number of columns in the output of Document Vector node). This matrix is very sparse, so it could be the reason. Another thing here what you could do is to increase the amount of memory for Knime JVM.

What you can also try to do is change the setting of PCA node - decrease the information fraction, or set up the fixed number of output dimension.

And one more note, maybe instead of using Document Vector you can use Redfield NLP nodes, and Spacy Vectorizer in particular to create these vectors. The benefit here is that the dimensions of the vectors are fixed, and the vectors are very dense. This way it may improve the PCA performance as well.

2 Likes

Thank you very much for your reply.

I found the way to copy the tag cloud diagrams. And I go on experimenting following your clues because I really see the potential in this workflow for my research in general not only with the current dataset. I will come back to this thread later and let you know if it worked out with greater number of documents.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.