How to know if all data has been analyzed? Uncategorized data cluster & Frequency

Hello,
I’m new here. I have to analyze 10.000 queries (my starting data set). I’ve created the CVS reader and the various clusters using the regex extractor.

  1. How do I know if I’ve considered all the data? Is there a way to see which data aren’t in my clusters creating an uncategorized cluster?

  2. How can I create clusters by frequency? Given a data set I’d like to have clusters based on more frequent words in my list: e.g. 20% of queries there is the word “dog”, 40% “cat”, etc…

PS: Can you suggest a method that doesn’t require Phyton, R, or another language?

Thanks a lot

Hi @Tilux ,

Welcome to the KNIME community forum!

The best place for getting example workflows is hub.knime.com. You can search for key words there and find multiple example workflows that you can download and build on top of them.

Regarding your specific questions

1 How do I know if I’ve considered all the data?

I didn’t quite get what you did with your workflow. Which regex extractor did you use? Are you able to share the workflow so that I can make suggestions?

If you are asking about the number of rows, there are a number of ways to get that information. One of them is via the node monitor as shown in the screenshot below. You can also hover on the output port (little black triangle) to see the size of the data.

image

  1. How can I create clusters by frequency? Given a data set I’d like to have clusters based on more frequent words in my list: e.g. 20% of queries there is the word “dog”, 40% “cat”, etc…

Again, could you please explain this a bit? Maybe with and example data and what you want to get as an output.

Best,
Temesgen

1 Like

Hello @temesgen-dadi ,
apologies for my late reply, due to several issues I couldn’t earlier. Now I have access to the analytics platform and the plugins.

Unfortunately, I can’t share the workflow, however, you gave me some interesting information. I’ll try to explain better myself:

  • I started with an excel file containing only 1 column: the header named “keywords” and below the list of long-tail keywords, each row can contain from 2 to 12 words. This data set consists of the extrapolation of the queries used by users in the search engines. So, my first step was to create the CVS Reader and upload the file
  • Second step: create the clusters using the regex extractor nodes. Knowing the topic I know the possible main clusters, however, I could miss some interesting clusters which I ignore.
  • This last point brings me to the main question: what if out of 10000 I had skipped some keywords because they are not included in the clusters from the start? How can I make sure that I’ve clusterized all the 10.000 long-tail keywords contained in the excel?

I hope is more clear and thank you so much for your help

Best,
Tilux

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.