Relative frequency of words from multiple text files

Hi guys, I hope you can help me with my problem:

We have multiple text files with german text inside, what we need is the relative frequency of certain words in the text e.g the relative frequency of an STTS-class like CARD or VV-IMP etc and we want the result (the rel. frequency of the text or the texts ) to be shown in one single row. 

We tested around with the different knime moduls and filters but so far the results are not satisfying (screenshot 1). 

 

Maybe another example to describe our goal: 

Beispiel: Upload 6 files; process it; the result is a table in which the relative frequency of words from one or several STTS-Classes is shown for each file. For each file, all the chosen tag-classes should be listed in one row  

 

So if someone knows what we are doing wrong, respectively doing right or what we have to insert/change in order to get the result we need, I/we would be really happy!

 

so thanks in advance :)

 

 

Hello Knimero11,

A combination of Loops and the GroupBy node should do the job. With loops you can execute one file at the time, calc the statistics, then put them into one file. The GroupBy allows you to pack all results of a file into one row.

Best,
Ferry

Hi ferry,

first of all, thanks for your response! 

Could you maybe explain me in detail which Loop- groupBy combination I should use ? Since there are quite a lot of loops and I still haven't bring it to work. I'm testing around with the different nodes for quite some time now, but there is still not the solution in view. Sry, maybe the solution is very easy but at the moment there is no signt to solve it by myself any time soon. 

Thanks again for your help,

greetings

Hi Knimero,

this depends on the structure of your txt files. If you have simple txt files (not csv formatted) containing text the Flat File Document Parser is the node to start with. This node reads all files in the specified dir. The file name is set as meta information which can be extracted later on.

Next use the Stanford Tagger, STTS Filter, Bag of Words, TF (absolute), Tags to String (extract tags), GroupBy (Tags & Document) (resulting in one row per tag and doc.) (as aggregation sum over abs TF). Remember, each file is one document.

In a parallel branch GroupBy by documents, sum over TF abs. Join sum (total TF abs) to other table (by document). Use Math Formula to divide first sum by total TF abs to get TF rel of specific tag.

Finally use Pivoting node, group by document (which is file), use tag strings as pivots.

I hope this helps.

Cheers, Kilian

Hi Kilian,

could you send a workflow example/screenshot please?

Thanks, Manuela