Row Concatenate and >10M rows

acassin · November 30, 2009, 2:54am

Hi all,

Further to Bug 2078, I am seeing similar behaviour with the Row Concatenate node for large datasets (I have ~16.4 million rows of short-read sequence data to process) with knime 2.0.3 and knime 2.1.0

I’ve worked around the problem via an external SQL database, but thought that a simple fix might be available.

thanks in advance,

wiswedel · December 2, 2009, 10:14am

Hi,

Thanks for reporting this. We have opened a bug and will fix this in future versions. Just to clarify what is causing this problem: The tables in KNIME have unique row IDs, so nodes that generate new IDs (such as RowID, Concatenate or File Reader) need to ensure that IDs are indeed unique (and if they are not, they need to be “uniquified”). This is done via a hash set that keeps the RowIDs in main memory …

Our fix for the RowID node (fix #2078) is to leave the responsibility to the user. The user can specify that the generated IDs are unique (if they are not, however, the node will fail during execution). This doesn’t require the expensive hash set. The fix for the Concatenate node will be of similar kind.

Thanks,
Bernd

baj · December 16, 2009, 4:01pm

Could you include an option to re-initialize the row indices. This way we don’t have to take care of this…
meaning if Table one has 2 rows with ids id_1 and id_3 and table two has id_2, id_3, id_4 , they could after merging have id_1, id_2, id_3, id_4, id_5

I guess you can conclude from my suggestion that I am not sure what to do with the unique row ids. Is there a description on when and where they are used?

Thanks,

Bernd

wiswedel · December 18, 2009, 3:05pm

Is there a description on when and where they are used?
You mean, what the unique rowIDs are used for? That’s a build-in requirement of KNIME. There are used, e.g. to determine the set of records to hilite (that’s why they need to be unique).