duplicate checker paramter

baj · November 4, 2010, 11:07am

In the DuplicateChecker there is a constant called MAX_CHUNK_SIZE. Is this somehow changable?

I am using AppendedRowsWithOptionalIn and runninng in GC problems(I guess): KNIME is crashing and the log says: ParallelGCFailedAllocation

I am appending 3 tables with in total some 60 million rows and many duplicate row_ids...

I am not sure if this variable would change much, but it would solve my too many open files problems most probably...

Thanks,

Bernd

thor · November 4, 2010, 11:41am

No, this parameter is not changeable. It will also not fix problems with too many open files since this is handled by another constant (MAX_STREAMS, which is 50). Also the amount of duplicate keys has no influence on the memory requirements.

wiswedel · November 5, 2010, 9:03am

The concatenate nodes keep a hash of previously seen rowIDs while iterating the different tables (in order to ensure uniqueness). Are the rowIDs in your case larger strings (larger than just "Row1", "Row2")? If so, you could try generating new (short!) rowIDs first using a Java Snippet + RowID node and feed that corrected input to the concatenate node.

This is of course only a workaround. There is an open bug for the concatenate node ... we plan to add an option "Fail on duplicates", where the user then needs to assert that there are no duplicates (we have efficient data structures to finally test for duplicates -- unfortunately that data structure can't be used for deduplication while it's iterating the tables).

baj · November 5, 2010, 9:29am

I think I am running into two problems: one is the GC problem that crashes the KNIME/Eclipse application and one with the open file pointers.

The first problem I cannot solve and have no clue on how come by... I have a workaround that works on split data sets, i.e. separating the complete data on a per chromosome base. And that seems to work...

The problem of the file pointer (this is not described here but in a different much older issue) I strongly believe can be overcome by adjusting the MAX_CHUNK_SIZE constant. I actually tried it and saw that there are larger files and fewer files...

In my case there are two general scenarios:

1. the preprocessing of the files: this has to handle a couple of hundred million rows (and only up to 10 or 20 columns) in a timely fashion. These tasks clearly don't need to check for duplicated IDs and this is for me great waste of time. This actually boils down to several hours of unnecessary CPU time (and I/O time). The latter is especially nasty because my temp folder is on a remote disk with a 100MB connection...

2. post processing: Here I am dealing with only 100s of thousands of rows and potentially ~100 columns (usually much less). Here I am interested in interacting with the data (i.e. hilighting/brushing)

My proposal would be to have a variable like the "expert_mode" for inactivating the duplicate checker, which just disables the call of the addRowKeyForDuplicateCheck(key) in DataContainer.java

Would this be something that you could consider discussing?