Speedup for ARFF Reader


(hope this is thre right forum to post feedback on the knime src code. I also wanted to ask whether there is a more "direct" way to suggest patches.)

I ran in a potential bottleneck in the ARFF Reader.

In the function extractNominalValues all values are extracted from the ARFF Header and stored in a Vector. To ensure that the values are unique it is verified that each value occurs only once.

This verification step is pretty expansive (quadradict(?)) for large domains. However, this can be easily made more efficently by replacing the Vector by a Set.

As far as I know the order of the attributes is crucial in the context of weka (nodes). As I do not know whether Knime respects the order internally I'm not sure whether to suggest to use a HashSet or an OrderdSet.


cheers Ingo

suggest to use a HashSet or an OrderdSet

I guess you meant LinkedHashSet.

True, there is room for improvement. We'll see if we can improve this a bit. Usually domains are rather small (<100 values) and in those cases you won't notice a difference between using an array or a set.