Unable to get correct Term Frequency in Text Processing

singhmanas · October 11, 2013, 12:18pm

Hi All,

I am trying to do the below but not successful. Will need your help

1. There is a folder location which has all the soure files of a Java application (.java files)

1. Prepare a list of all these java files and create "Document" out of all these files.

2. Simultaneously prepare a list of all Java classes being used in all these files.

3. Then try to find the frequency of these individual java classes in all the "Documents" i have found in step1.

So if a Class "Foo" is used 10 times, i need a Frequency of 10.

Step 1 and 2 are working fine. But Step 3 is not giving the correct results. I am sure i am doing something wrong here.

The "TF" node is listing all occurences of a class seperately and giving a frequnecy of 1 to eahc of them.

Attaching my workflow.

Any help/pointers will be appreiated.

classesandinterfaces.zip

kilian.thiel · October 14, 2013, 12:59pm

Hi singhmanas,

the "TF" node counts the frequency of terms/words in one document. I guess you want to count the frequency of the terms in the complete corpus (all java files). Therefore simply use the "GroupBy" node, right after the "TF" node to aggregate all frequencies of terms (in the documents). Use the "Term" column as Group column in the "Groups" tabs of the dialog of the GroupBy node and in the "Options" tab add the "TF abs" column and select "Sum" as aggregation method. This will sum up all tf values over all documents (java classes).

Cheers, Kilian

singhmanas · October 15, 2013, 1:39pm

Hi Kilian,

That worked perfectly.

Thanks a ton :)

Regards,

Manas

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.