Compare rows in 2 files

Hi all
I have a file A and a file B and for each line of file B I would like to read all those of file B, test if field X of file B contains field X of file A and if so append field X to file A.
How can I do it?
Thanks in advance

Hello @Albert0_123,

it’s better to share some example so we don’t need to play guessing game what’s line, what’s field and where should what go. Data input and desired output with logic explained if not obvious is appreciated. Dummy data works just fine…

Br,
Ivan

7 Likes

Hi Ivan,
I have two files called A and B, about 65,000 lines for the first and 154,000 for the second.
In both files there is a column which I call XA for the first and XB for the second.
In file A there are columns XA that are exactly the same or even just part of XB and I want to find them and mark them.
So if XA=“abcde” for me the condition is verified by XB=“abcde” or even XB=“jjjjabcdekkkk”.
I have already achieved the goal using the
Cross Joiner nodes but I wanted to do the same thing using loop nodes.
I did it like this:
Table Row to Variable Loop (input port file A) → Excel Reader (file B) ->Java Snippet (for testing and marking with a found flag) → Loop End.
I’ve seen that it works but even when working with small parts of A files such as 1000 lines the workflow fails due to Java Heap Space problems.
Am I doing something wrong? Could I use different loop nodes? Are there better ways to get there?
Thanks in advance,
Alberto

image

Hi ivan
in the image above there is an example of how I thought I would make my workflow work using a loop…

I @Albert0_123

Supose you have this 2 rows
XA XB
aaa dddaaaccc
ccc aaabbbccc

If XA inside XB it can result in 4 rows

dddaaaccc ->aaa
aaabbbccc ->aaa
dddaaaccc ->ccc
aaabbbccc ->ccc

Is this what you need?
Regards

2 Likes

Hi, yes exactly like that.
If I do it with the Cross Joiner it takes hours but using blocks of A of 5000 rows I can complete while with the Loop node I can’t go beyond 500 rows of A per block without going into Java Heap Space so, in practice, I can’t use it .
I also minimized the number of A and B columns.
With the Cross Joiner node I have already done and completed the work but I would like to understand how to use loop nodes efficiently, either the ones like in the figure above or possibly also other types.
Thank you very much

Br,
Alberto

1 Like

Hi @Albert0_123

I don’t know the performance of this solution, but you can try with your data and tell me if it is ok.

Assuming your input data
image
image
The workflow:


StringsInStrings.knwf (99.7 KB)

The result
image
Best regards

3 Likes

Hi all ,
Thank you all for the discussion. I’ll give my understanding of this issue: This issue seems to be a keyword matching problem. So I defined a keyword table (better not too long), and a short text table (business text). The question is which keywords are contained in the short text?

fuzzy_string_search0.1.knwf (17.6 KB)

Due to limited test samples, I am not aware of the performance of this workflow. If possible, let me know the result.

BR

PS: Sorry, I didn’t notice the scale of the data. The small-scale data mentioned above is feasible, while large-scale data is not. 65,000 & 154,000

Hi Hmfa,
Thanks so much for the suggestion, it’s a really good idea. I tried it on a considerable part of my data, it works well and above all with very good performance, I assume around 6 hours to complete the processing, which is fine (my version was effectively unusable). In the next few days I’ll launch it with the complete database… I’ll keep you informed.
Thanks again
Br
Alberto

Great I could be of assistance.
Best regards.

Hi Tomljh,
nice idea to concatenate the keys in a row… this way you avoid any loop node… the performance must be really excellent but at the moment I can’t try it because I don’t have the Regex Substring node available. I plan to have it soon… but right away I ask you: is it possible to replace it with others?
Anyway, thanks a lot… I’ll definitely try it on the complete data as soon as I can and let you know.
Br
Alberto

If I understand correctly, the standard node (Regex Split) requires full matching, while in the current environment we require partial matching, so it seems that only the community node (Regex Substring) is suitable.

HI tomljh,
thanks for the clarifications.
In the meantime I managed to launch your workflow with the Regex Substring node using my real data, around 52000 K rows for File A (your Keyword Table / Dictionary Table) and 154000 for File B and I must say that it processed them correctly and in very quick times, about twenty minutes in total.
I would say really excellent.
Thank you very much,
Br
Alberto

1 Like