Flat File Document Parser Memory Issue

Hello,

I think I am running into a memory issue that I can't figure out. I have already increased my Heap Size as recommended in other threads, and changed the memory intense nodes to 'write all tables to disk', as opposed to just large tables.

My task: I have about 3 thousand text files (most less than 1 MB) sorted by category on a server (about 500 categories). This results in each folder on the server containing between 3 and 10 files, with the max size ive seen so far being 15MB for a folder. I am trying to get a count of how often a phrase appears in each category (can occur multiple times in each file, so essentially a sum of the folder).

Process: I am able to successfully generate a list of all the files on the server in each folder/category. I then iterate through each folder/catergory where I download all the .txt files to my local machine, read the files with Flat File Document Parser, then use the tf node to get my counts. I then delete all the files, and move on to the next folder, storing the total count information.

Problem: This process works really great for about 80 iterations. After about 80 iterations, the Flat File Document Parser bogs down. My heap status never clears to below 6kMb, and obviously knime becomes unresponsive. I assume I have a memory leak, but the only node that really bogs down is the Flat File Document Parser?

Any advice? Anyone encounter something like this before?

 

Hi ahardy,

one thing you can try is to get rid of the column containing the documents before the loop end (Column Filter). As far as I understood you only need the folder/category a phrase and a count as columns in one row. Get rid of all other columns before ending the loop. This would reduce the data to collect at the end.

Is the Flat File Parser bogging alsways at the same directory? Is it possible that there are very large txt files in the directories?

Cheers, Kilian

Thanks for the response.

Before the loop end I do a group by and sum all the counts of that folder. I then store that aggregated number with a folder identifier, filtering out everything else. So I don't think my end data set is the issue. 

The Flat File Parser is not always bogging down at the same file. I am currently working on running my loop through a bash script and opening and closing knime each folder iteration. While not ideal because I loose flexibility, it seems to be working. I would eventually like to be able to do this completely in Knime.

Any other ideas?

Next thing you can try is to reduce the number of parallel threads that KNIME is using.

File->Preferences->KNIME->Maximum working threads for all nodes

set this setting to 2.

All parser nodes are parallelized and make use of all possible threads. If the Flat File Reader reads some big files at the same time and creates documents it can happen that mem is getting too low. Try using 2 threads only. In my experiment  lot less mem was used at maximum. Please let me know if this works for you.

Cheers, Kilian