Text Processing

imashish · April 17, 2018, 7:17pm

Hi ,
I have 2 Columns which contains same information i.e name of famous brands and I want to take out all the word for every row from both the Columns and then then perform the exact match for the words and for that did a cross join and then count the number of matching words and I have flag which counts the match word and for this splitting and matching part I have written the code in python but the process is taking too long because of the Data size I have
P.S: I have a 128 Gb machine to work with
My Sample Data :

ID Col A Match_With
1 Tata Motors Tata Motors
2 Pepsi Co Pepsi Co
3 Tata Cola Tata Cola

Output Data after Splitting and Matching

ID Col A Match With Col A Words Matchwith Words TotalMatchingWord

   Tata Motors             Tata Motors     {Tata,Motors}       {Tata,Motors}                      2

   Tata Motors              pepsi Co          {Tata,Motors}       {Pepsi , Co}                        0

3 Tata Motors Tata Cola {Tata ,Motors} {Tata ,Cola} 1
and so on … I have uploaded a screen shot of data as well Data

I have optimized the python code to a extent but it is really very slow because I have 76K * 76 K records and every row in colA has to be checked for all 76K rows.

Can someone suggest me if there is any Node available that can process this big data with same operation applied or another approach to process the same .

Regards
Ashish

imashish · April 19, 2018, 1:42pm

Hey Martin , thanks again, For that very problem i have written a Java Snippet and it worked ,
but there is one more constraint that KNIME is hardly using 2GB RAM and i have 128GB Ram, Is there any way i can increase it so that i can process faster ?
i tried Changing ini file increase the heap memory (-Xmx=g) and -Dorg.knime.container.cellsinmemory=10000000
But nothing happened

Regards
Ashish kumar

julian.bunzel · April 24, 2018, 7:00am

Hey Ashish,

you could also try to set -Xms (which is the initial amount of memory KNIME starts with).
You can also set your node’s memory policy to “Keep all in memory”. To do so open the node dialog of the specific node and select the ‘Memory policy’ tab.

Cheers,

Julian

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.