Unable to get correct Term Frequency in Text Processing

Hi All,

I am trying to do the below but not successful. Will need your help

1. There is a folder location which has all the soure files of a Java application (.java files)

1. Prepare a list of all these java files and create "Document" out of all these files.

2. Simultaneously prepare a list of all Java classes being used in all these files.

3. Then try to find the frequency of these individual java classes in all the "Documents" i have found in step1.

So if a Class "Foo" is used  10 times, i need a Frequency of 10.

Step 1 and 2 are working fine. But Step 3 is not giving the correct results. I am sure i am doing something wrong here.

The "TF" node is listing all occurences of a class seperately and giving a frequnecy of 1 to eahc of them.

Attaching my workflow.

Any help/pointers will be appreiated.

Hi singhmanas,

the "TF" node counts the frequency of terms/words in one document. I guess you want to count the frequency of the terms in the complete corpus (all java files). Therefore simply use the "GroupBy" node, right after the "TF" node to aggregate all frequencies of terms (in the documents). Use the "Term" column as Group column in the "Groups" tabs of the dialog of the GroupBy node and in the "Options" tab add the "TF abs" column and select "Sum" as aggregation method. This will sum up all tf values over all documents (java classes).

Cheers, Kilian

Hi Kilian,

That worked perfectly.

Thanks a ton :)




This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.