Words mattich - text Processing

imashish · April 17, 2018, 7:13pm

Hi ,
I have 2 Columns which contains same information i.e name of famous brands and I want to take out all the word for every row from both the Columns and then then perform the exact match for the words and for that did a cross join and then count the number of matching words and I have flag which counts the match word and for this splitting and matching part I have written the code in python but the process is taking too long because of the Data size I have
P.S: I have a 128 Gb machine to work with
My Sample Data :

ID Col A Match_With
1 Tata Motors Tata Motors
2 Pepsi Co Pepsi Co
3 Tata Cola Tata Cola

Output Data after Splitting and Matching

ID Col A Match With Col A Words Matchwith Words TotalMatchingWord

   Tata Motors             Tata Motors     {Tata,Motors}       {Tata,Motors}                      2

   Tata Motors              pepsi Co          {Tata,Motors}       {Pepsi , Co}                        0

3 Tata Motors Tata Cola {Tata ,Motors} {Tata ,Cola} 1
and so on …

I have optimized the python code to a extent but it is really very slow because I have 76K * 76 K records and every row in colA has to be checked for all 76K rows.

Can someone suggest me if there is any Node available that can process this big data with same operation applied or another approach to process the same .

Regards
Ashish

Martin_K · April 18, 2018, 10:27am

Hi,

The workflow I have attached is based on “standard” Knime nodes (no Python code),
even not using any node from Text processing repository. Try to adapt it for yourself.
However, there is also a critical point in “Cross Joiner” node as for memory consumption.

Martin K.

Match_words.knwf (16.9 KB)

imashish · April 18, 2018, 4:40pm

Hi Martin ,
The workFlow is doing wonders . thanks for the much help, i really appreciate that ., but it di d tweaked the workflow a bit .
Now In the Cell Spliter Node i have used the option of output as set , so now while when i m exporting the data to CSV i m getting the error that Input table should be int or double , any work around for this ?

Regards
Ashish kumar

Martin_K · April 19, 2018, 7:37am

Hi Ashish kumar,

Thank you very much . I suppose you have used CSV writer node. Insert “Split Collection Column”
node between the node containing column of Set type and CSV writer and break Set column into columns
of primitive data type like string, double, int. Then writing a CSV file should be working.
Regards !

Martin K.

imashish · April 19, 2018, 2:27pm

Now the situation is that I Have a 128 GB Ram But i want KNIME to use atlst 100GB but while i run heavy codes it is only Using maximum 2GB,
I have tried 2 option -Xmx and second was cellsinmemeory but still no Luck .

julian.bunzel · April 24, 2018, 7:04am

Hey Ashish,

I answered in another post.

Cheers,

Julian

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.