I’m doing a long tail keyword analysis, so I have my list of 10.000 queries.
Now, before using the bag of words and then the frequency calculation I need to consider 3 words as a term: e.g.
go for joe = one keyword/term
joe = 1 keyword if is not preceded by “go for”
I have to consider also that “go for joe” is written in different ways:
go 4 joe
go for joe
- How can I extract my 3 words keyword as 1?
- How to differentiate the cases in which there is only “joe”
Can someone help me?
Thanks in advance
I build a little example workflow for you, which you can find on the KNIME Hub: Extracting a keyword consisting of multiple words and written in different ways – KNIME Hub
The idea is to create a table that contains all different ways and defines one representative for all of them. Next you can tag you document based on this table, so that your three words are converted to one term (Dictionary Tagger with “set named entities unmodifiable” unchecked). Afterwards you can replace all different ways with the representative (Dictionary Replacer).
Could this be a solution for your use case?
Have a nice weekend
thank you so much! The workflow is very clear I’m so happy about that.
If I may, I’d like to better understand 2 points:
- I try to add in your workflow the node STop word filter, before the Bag of Words node, to avoid being considered as a term “the dot, or, as, …”. However it didn’t change anything, Can you tell me why?
- can you tell me the difference between TF absolute and TF relative?
Thank you very much for your help
Have a nice day
you are welcome
Regarding your questions:
Point 1: In every text preprocessing node you can decide whether you want to append a new document column with the preprocessed document or you want to replace the old document column. This can be controlled via the Preprocessing tab. Could it be that you are using the default setting to append a new column, but you afterwards use the original document column for the bag of words?
Absolute Term Frequency: The absolute number how often a term occurs in a document
Relative Term Frequency: The absolute number how often a term occurs in a document divided by the number of terms in a document.
you’re right I replace the column and now it works, to filter the punctuation, before the stop words filter, I’ve added the node Punctuation Erasure. The workflow works, even with the nodes I’ve added as a supplemental test and, I get it!
For the TF I have to study better which one to use on a case-by-case basis (mine would be an analysis of queries users write on the search engines).
Thank you so much @Kathrin you have been very helpful
You are welcome
Thank you @Tilux for the update.