as I am doing some testing I am running into problems with the sorting algorithm...
I am trying to sort on a String column. There might be some 1,000,000 or so same Strings in a table with about 250,000,000 rows and about 10 columns.
With 16GB of memory (xmx) I am running (after about 6-7hr) into a gc problem. There are also many files opened, which causes a different problem. I have the cells in memory variable set to 1,000,000. I couldn't identify any difference with this variable other than crashing the VM when set too high (100,000,000 with 16GB xmx).
Could you please have a look and check if it is possible to reduce the number of files that are open at the same time, and if the GC (garbage collector) problem can be avoided somehow?
Please let me know if you need any further information (I have log files and sample files to reproduce this but don't want to flood the system with them...
May I add to this that I just did the same sort that is now taking more than 10 hours and used gawk to modify the input file (to translate the input file into one row = one record; the same thing is done within KNIME too) and then used /bin/sort through a pipe. Everything was done within 100min whereas the java version is soo much slower. Do you know what the reason for this might be? Do you think it would be advisable develop a different sort algorithm or should I use /bin/sort. It seems to me that the overhead of writing and reading the data would be still better than the java sort...
Ouch. Can you give some more details on the data set, specifically the column types and the average string length?
I've tried to reproduce the problem on a data set with the same dimensionality (250Mio x 10), whereby the data has been generated using the "Data Generator" node. It works as expected, the "Data Generator" node takes 1:44h to execute and the "Sorter" 4:15h.
I also don't have a clue where all the open files should come from. The sorter does not open more than a specified number of files (defaults to 40). Are BLOBs involved?
here is the log for the specific nodes:
tests_sort-2011-02-23-03-20.log:2011-02-23 07:05:20,272 INFO KNIME-Worker-0 LocalNodeExecutionJob : FastQReader 0:0:341 End execute (3 hours, 44 mins, 24 secs)
tests_sort-2011-02-23-03-20.log:2011-02-23 16:32:21,393 INFO KNIME-Worker-2 LocalNodeExecutionJob : Sorter 0:0:344 End execute (9 hours, 26 mins, 59 secs)
This is a sample record:
I am sorting for the 2nd row (i.e. column). I also tested with sort |uniq -c and there is at least one record that occur 12M times...
With respect to the number of file pointers, let's forget about this for the moment.. ;)
I am still supprised to see 9hrs instead of 100min...
I can bring the data next week so we can have a closer look...