How to merge large files without memory overflow in KNIME?

qianyi · June 5, 2019, 5:59am

I have created a workflow to merge files from two file groups, the Group1 and Group2.

This file merge seems simple if I could concatenate all the rows to one file for each file group. But I’m worrying about memory overflow error since the files are too many and too large.

The real data of Group1 has over 10,000 files, each file 5MB.
Group2 has 700 files, each file 100MB (each file containing over 100,000 rows).

I try to test my workflow from the following image.
WS000001

Data of Group1 CSV Reader
WS000002

Data of Group2 CSV Reader
WS000003

See the 1st. column of both Group, which actually is the merge key column containing a Date&time Sring.
My task is to make full data of Group1, which means need to merge Group2 to Group1.

To achieve this merge, the detail expected operation would be:

Step1. Find the 1st. row of Group1 CSV.
Step2. Seach the rows of Group2 CSV,
If the key-column matched, merge columns of Group1 CSV to Group2 CSV.
If the current CSV doesn’t have any match of key-column, then go to the next CSV of Group2, until matched or the last row of the last CSV in Group2.
Step3. Go to the next row of Group1 CSV until the last row, repeat the Operation2
Step4. Go to the next CSV of Group1, back to the Step1.
Step5. End at the last CSV of Group2.

This operation in KNIME is too complicated for me.
Or there would be a better way to do this large file merge(?).

How can I complete this large files merge operation in KNIME? Please give some adice for me.

Thanks in advance!

qianyi · June 5, 2019, 6:28am

Upload my test workflow here. You may confirm and use these samle data.

XtoY_Merger.knwf (1.1 MB)

morpheus · June 5, 2019, 11:30am

Hi @qianyi,

looking to your workflow you try to merge log-file entries to possibly each row of your Group1 data.

In my view this is very time consuming. It maybe the best way to reduce the Files in Group2 by looking to the File modification timestamp (File Meta Info Node) to reduce (filter out) the files you have to scan. Nethertheless the workflow is very time consuming.

The other option is to merge all files of Group1 together and the same for Group2 and join this very large files. Not sure if this is feasible due to the filesize and memory usage.

qianyi · June 6, 2019, 2:30am

Hi @morpheus,

Thanks for your reply. Today I will firstly try to merge all files of both Groups to see whether it work well or not.
This is also time consuming because I need to run other workflow to produce these data before merge. I may report the result next week…

beginner · June 6, 2019, 7:37am

I agree with morpheus. Make 2 tables each containing all rows of group 1 and group 2 respectively then join them on the key-column using joiner node. KNIME stores large table on disk by default. What will certainly help here is using a fast ssd.

Other thing that could help: streaming execution.

qianyi · June 10, 2019, 7:21am

Hi @beginner,

I am really really the beginner one to KNIME …

It’s good to know KNIME stores large table on disk.

I found the Directory for temporary files setting in Preference where the default setting is:
C:\Users\UserFolder\AppData\Local\Temp\

Is KNIME storing table in this temporary directory?,

I’m considering to change this directory to D: since the empty space of C: is under 30GB now.
(Please advise me if 30GB is also enough even when operating large table over 70GB data,
because the C: is SSD…

I don’t know streaming execution, so will study it anyhow. Thanks for this tip!

beginner · June 12, 2019, 7:11am

The data is stored within the workflow which means in the workspace directory.

It’s possible that directory can get very large.

With 30gb free space on a 70gb table you will run into a problem most likely assuming your workspace is on the ssd. SSDs also have the tendency to get very, very slow when near full. I would never go under 10GB free space (best option to ensure that is increase over provisioning of the ssd).

My suggestion: Get a big and fast second ssd for knime if you are often working with these large datasets.

ipazin · June 12, 2019, 11:19am

Hi there @qianyi!

In addition to advice you already received and in case you haven’t came across it there is a nice blog post about optimizing KNIME workflows which I highly recommend.

There you will find things like memory options and handling temporary data in which you might be interested the most

Br,
Ivan

qianyi · June 13, 2019, 12:48am

Thanks for your advice. Strongly agree with you that bigger SSD is necessary .

Recently I often work with large data but usually split data to small files/tables for manipulation. So no large tables need to be stored at SSD until this time…

qianyi · June 13, 2019, 12:57am

Hi @ipazin

Already took a glance at this post, and it’s very valuable to me .
I will reference it to optimize and create my workflow.

Thanks!

system · December 12, 2019, 12:57pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.