Joiner makes my file big and slow

So I have a file with around 25000 KB, 700K rows. In this file I have the Product ID but I do not have Product info. and I used Joiner to join another table, which has Product ID and one row of Info. After I joined the table and I remove unnecessary columns in the Joiner. The file goes from 25000 KB to 492000 KB. Also the Data has separate into few sheets How can I make it smaller and faster and keep all data in the same sheet?
Thank you very much

May I ask what file type are you reading from / writing to?

How often does this happen?

In general there are some ways that can result in performance improvements from allocating more memory to KNIME (increase from the default 2gb).

@mlauber71 dedicated an article to performance topics - might be worth a read: Mastering KNIME: Unlocking Peak Performance with Expert Tips and Smart Settings | by Markus Lauber | Low Code for Data Science | Medium

2 Likes

Hi @newba,

since you’re talking about “sheet”, I assume your input and output files are Excel files?
If so, Excel has a limit of 1,048,576 rows per sheet. If your table has more rows we need to split the table to multiple sheets in the Writer node… but that should not be a problem since I think you expect only 700k output rows, anyway, right?

Maybe the input tables contain duplicates with respect to the Product ID (and it is not a proper “ID” column in that table):
Since you specify 700k rows and the “Product ID” column as join key, a “normal” join (key-foreign key join) on the ID should not increase the number of rows. So I suspect either of your input files contains duplicate rows, so at least some Product IDs occur multiple times in one of the inputs. Or the join condition is wrong… Could you show the Joiner configuration and count the unique “Product ID” values in the second product info table? Or use the RowID node to make the “Product ID” the row id of the info table, then the duplicate checker will complain if your input contained duplicates.

For example, the input tables may be:

  1. “PURCHASES”, 700k rows with [Customer, Product ID]
  2. “PRODUCTS”, 200 rows with [Product ID, Product Name]

When the table 2 contains duplicate rows (so more names for the same Product ID), the join PURCHASES join PRODUCTS on "Product ID" will produce more than 700k output rows, which appears to be what is happening.

1 Like

I am reading from and writing to excel.Thank you for the article. I will have a look

You are right. there’s duplication row in my input table. I didn’t remove them in Knime as I put the remove duplicate step in Powerbi. I tried to remove duplicate in Knime before but didn’t go well and I was running out of time so I put the step inside Powerbi instead.

You can use the Duplicate Row Filter node to remove duplicates from the product info table before joining with the big 700K table. This will speed up your workflow, solve the problem with multiple sheets, and you don’t need the PowerBI option.

4 Likes

I tried the Duplicate Row Filter and the file now is so much smaller than before! Thank you so much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.