Compare rows in 2 files

Hi all
I have a file A and a file B and for each line of file B I would like to read all those of file B, test if field X of file B contains field X of file A and if so append field X to file A.
How can I do it?
Thanks in advance

Hello @Albert0_123,

it’s better to share some example so we don’t need to play guessing game what’s line, what’s field and where should what go. Data input and desired output with logic explained if not obvious is appreciated. Dummy data works just fine…



Hi Ivan,
I have two files called A and B, about 65,000 lines for the first and 154,000 for the second.
In both files there is a column which I call XA for the first and XB for the second.
In file A there are columns XA that are exactly the same or even just part of XB and I want to find them and mark them.
So if XA=“abcde” for me the condition is verified by XB=“abcde” or even XB=“jjjjabcdekkkk”.
I have already achieved the goal using the
Cross Joiner nodes but I wanted to do the same thing using loop nodes.
I did it like this:
Table Row to Variable Loop (input port file A) → Excel Reader (file B) ->Java Snippet (for testing and marking with a found flag) → Loop End.
I’ve seen that it works but even when working with small parts of A files such as 1000 lines the workflow fails due to Java Heap Space problems.
Am I doing something wrong? Could I use different loop nodes? Are there better ways to get there?
Thanks in advance,


Hi ivan
in the image above there is an example of how I thought I would make my workflow work using a loop…

I @Albert0_123

Supose you have this 2 rows
aaa dddaaaccc
ccc aaabbbccc

If XA inside XB it can result in 4 rows

dddaaaccc ->aaa
aaabbbccc ->aaa
dddaaaccc ->ccc
aaabbbccc ->ccc

Is this what you need?


Hi, yes exactly like that.
If I do it with the Cross Joiner it takes hours but using blocks of A of 5000 rows I can complete while with the Loop node I can’t go beyond 500 rows of A per block without going into Java Heap Space so, in practice, I can’t use it .
I also minimized the number of A and B columns.
With the Cross Joiner node I have already done and completed the work but I would like to understand how to use loop nodes efficiently, either the ones like in the figure above or possibly also other types.
Thank you very much


1 Like

Hi @Albert0_123

I don’t know the performance of this solution, but you can try with your data and tell me if it is ok.

Assuming your input data
The workflow:

StringsInStrings.knwf (99.7 KB)

The result
Best regards


Hi all ,
Thank you all for the discussion. I’ll give my understanding of this issue: This issue seems to be a keyword matching problem. So I defined a keyword table (better not too long), and a short text table (business text). The question is which keywords are contained in the short text?

fuzzy_string_search0.1.knwf (17.6 KB)

Due to limited test samples, I am not aware of the performance of this workflow. If possible, let me know the result.


PS: Sorry, I didn’t notice the scale of the data. The small-scale data mentioned above is feasible, while large-scale data is not. 65,000 & 154,000

Hi Hmfa,
Thanks so much for the suggestion, it’s a really good idea. I tried it on a considerable part of my data, it works well and above all with very good performance, I assume around 6 hours to complete the processing, which is fine (my version was effectively unusable). In the next few days I’ll launch it with the complete database… I’ll keep you informed.
Thanks again

Great I could be of assistance.
Best regards.

Hi Tomljh,
nice idea to concatenate the keys in a row… this way you avoid any loop node… the performance must be really excellent but at the moment I can’t try it because I don’t have the Regex Substring node available. I plan to have it soon… but right away I ask you: is it possible to replace it with others?
Anyway, thanks a lot… I’ll definitely try it on the complete data as soon as I can and let you know.

If I understand correctly, the standard node (Regex Split) requires full matching, while in the current environment we require partial matching, so it seems that only the community node (Regex Substring) is suitable.

HI tomljh,
thanks for the clarifications.
In the meantime I managed to launch your workflow with the Regex Substring node using my real data, around 52000 K rows for File A (your Keyword Table / Dictionary Table) and 154000 for File B and I must say that it processed them correctly and in very quick times, about twenty minutes in total.
I would say really excellent.
Thank you very much,

1 Like