Introducing the join node alternative for comparison

alex1368 · February 3, 2024, 1:22am

Hello
I use the join node to compare two columns of the table, which uses a lot of resources when working with large and heavy files and slows down my work.
Please teach me what other nodes I can use instead of the join node so that my analysis speed does not decrease?

rfeigel · February 3, 2024, 4:11am

Have you tried the Column Comparator node?

alex1368 · February 3, 2024, 5:53pm

I think you misunderstood what I meant!!!
I am using “join” node to extract the data of one column from a very large file, which takes a lot of time to extract when compared.
So I’m looking to find an alternative node to the “join” node that will increase the speed.

rfeigel · February 4, 2024, 12:18am

I’m still confused. Why are you using a join to extract one column from a table? What are you “comparing”? A little more detail would be helpful.

kamtaot · February 4, 2024, 5:37am

Hi @alex1368 , If you can post some dummy data of both tables which you are trying to join, and the field that you are trying to extract, that gives better insight and one of the experts will be able to share a solution for the problem. If possible, some statistics like number of rows and columns in each table etc. will give better understanding of the problem.

alex1368 · February 5, 2024, 1:19am

Hello
Eyes for sure! I’ll share a workflow that gives a small example of what I need.
This is a very, very small example of my file (my main file has about 45 columns and over 200 million rows).
With this method, I extract my desired targets from the database.
My problem is that the speed of this method is slow in very large data and it wastes time, and my request is that you teach me a method to increase the speed.
KNIME_project.knwf (15.1 KB)

rfeigel · February 5, 2024, 2:07am

There’s no data in your workflow. You’ve got it pointed to a local mountpoint. Put it in the data folder in the workflow. Please explain in some detail what you’re trying to do.

kamtaot · February 5, 2024, 3:50am

Hi @alex1368 , Like @rfeigel mentioned, please put the data in the workflow and then upload the workflow. That way, the minimum set of data that you want to share with us will come along with the workflow.

alex1368 · February 5, 2024, 1:50pm

Hi friends, I am now uploading the database and target files as follows.
DATABASE.xlsx (8.8 KB)
TARGET.xlsx (8.4 KB)

Daniel_Weikert · February 5, 2024, 5:54pm

Could you do the join in a database? Probably faster then reading it first in KNIME and process there

alex1368 · February 7, 2024, 12:48am

I do exactly that and call the data from the database!
Now I want to learn a faster method and that’s why I asked for your help!

rfeigel · February 7, 2024, 3:39am

I’m inclined to agree with @Daniel_Weikert. I think you’d be better off joining in your database. Knime is a powerful tool, but that doesn’t mean its the best for everything. Having said that, have your tried converting the join columns from strings to integers? You should do that before you importing to Knime. That should help. How much memory do you have allocated to Knime? The data set you provided is tiny so its hard to do any serious testing. The attached workflow has a variety of joins and filters with a Timer Info node. With such a small dataset I’m not sure how much I trust the timer. Regardless the join runs faster with integers than strings although the overhead to do the conversion in Knime probably is not worth it.

Daniel_Weikert · February 7, 2024, 5:14pm

There is no faster way than doing it inside the source (database) to my knowledge. I would try to make sure everything there is optimized

rfeigel · February 8, 2024, 12:44am

You should look at this thread. It isn’t aimed primarily at speed but does have a lot of good info about “big data” joins. With the loops its almost certainly slower than a database join but interesting nevertheless.

system · May 8, 2024, 12:44am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.