Hi ,
I have 2 Columns which contains same information i.e name of famous brands and I want to take out all the word for every row from both the Columns and then then perform the exact match for the words and for that did a cross join and then count the number of matching words and I have flag which counts the match word and for this splitting and matching part I have written the code in python but the process is taking too long because of the Data size I have
P.S: I have a 128 Gb machine to work with
My Sample Data :
ID Col A Match_With
1 Tata Motors Tata Motors
2 Pepsi Co Pepsi Co
3 Tata Cola Tata Cola
Output Data after Splitting and Matching
ID Col A Match With Col A Words Matchwith Words TotalMatchingWord
-
Tata Motors Tata Motors {Tata,Motors} {Tata,Motors} 2
-
Tata Motors pepsi Co {Tata,Motors} {Pepsi , Co} 0
3 Tata Motors Tata Cola {Tata ,Motors} {Tata ,Cola} 1
and so on … I have uploaded a screen shot of data as well
I have optimized the python code to a extent but it is really very slow because I have 76K * 76 K records and every row in colA has to be checked for all 76K rows.
Can someone suggest me if there is any Node available that can process this big data with same operation applied or another approach to process the same .
Regards
Ashish