Potential bug or inappropriate usage of bit vector exclusive-or

While extending my BitSet integration and BitVector extension I've either come accross a bug or i've been able to use something in a way it wasn't intended. 

 

If I have a BitVector column where the lenghts vary the GroupBy node errors on the "Bit vector exclusive-or" aggregation method if the first BitVector has a smaller length than a subsequent bit vector. 

 

Here is the console output:

 

ERROR     GroupBy                            Execute failed: Index ('4') too large for vector of length 4
DEBUG     GroupBy                            Execute failed: Index ('4') too large for vector of length 4
java.lang.ArrayIndexOutOfBoundsException: Index ('4') too large for vector of length 4
    at org.knime.core.data.vector.bitvector.DenseBitVector.get(DenseBitVector.java:549)
    at org.knime.core.data.vector.bitvector.DenseBitVectorCellFactory.get(DenseBitVectorCellFactory.java:312)
    at org.knime.base.data.aggregation.bitvector.BitVectorXOrOperator.computeInternal(BitVectorXOrOperator.java:120)
    at org.knime.base.data.aggregation.AggregationOperator.computeInternal(AggregationOperator.java:343)
    at org.knime.base.data.aggregation.AggregationOperator.compute(AggregationOperator.java:306)
    at org.knime.base.node.preproc.groupby.BigGroupByTable.createGroupByTable(BigGroupByTable.java:272)
    at org.knime.base.node.preproc.groupby.GroupByTable.<init>(GroupByTable.java:217)
    at org.knime.base.node.preproc.groupby.GroupByTable.<init>(GroupByTable.java:120)
    at org.knime.base.node.preproc.groupby.BigGroupByTable.<init>(BigGroupByTable.java:118)
    at org.knime.base.node.preproc.groupby.GroupByNodeModel.createGroupByTable(GroupByNodeModel.java:683)
    at org.knime.base.node.preproc.groupby.GroupByNodeModel.createGroupByTable(GroupByNodeModel.java:646)
    at org.knime.base.node.preproc.groupby.GroupByNodeModel.createGroupByTable(GroupByNodeModel.java:627)
    at org.knime.base.node.preproc.groupby.GroupByNodeModel.execute(GroupByNodeModel.java:609)
    at org.knime.core.node.NodeModel.executeModel(NodeModel.java:555)
    at org.knime.core.node.Node.invokeFullyNodeModelExecute(Node.java:1131)
    at org.knime.core.node.Node.execute(Node.java:927)
    at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:559)
    at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:95)
    at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:179)
    at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:110)
    at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:328)
    at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:204)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:125)
    at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:248)

I have attached an example workflow. 

Let me know if you need anything else to recreate the issue. I'm using the 2.11.2 SDK. 

 

Cheers

Sam

 

Hi Sam,

the current implementation assumes that the BitVectos of a column have the same length. That is why the node fails if you have a column with BitVectods of different lengths. I wonder if BitVectors of varying length do make sense. How do you compare these? Would you just ignore missing positions?

Thanks

Tobias

When dealing with BitSets (java.util.BitSet) of different lenghts all the Bits between the length of the smaller and the length of the larger are assumed to be false. 

Here is an example behaviour for BitSet:

BitStrings
setOne {68} - length: 69 - size: 128
setTwo {68, 256} - length: 257 - size: 320

Operations
setOne.xor(setTwo) = {256} - length: 257 - size: 320
setOne.and(setTwo) = {68} - length: 69 - size: 128
setOne.or(setTwo) = {68, 256} - length: 257 - size: 320

-----

Whether I should be creating a column with BitVectors of varying length is another point. In instances where a descriptive fingerprint is known then the length of the fingerprint is fixed and known for all rows at the start so creating a fixed length fingerprint consistently is no problem. 

How about when using a BitString to represent a list of atoms where the molecules differ per row? The length could be dictated by the number of atoms in the molecule. So when grouping on a molecule ID each group would be given a set of BitStrings with the same length but different groups would have different lengths. This shouldn't pose a problem for the aggregator. 

Variable length BitStirngs also cause a problem for expanding into columns. The table would need to be parsed first to check for max length unless the user defined the length in the dialog (for the node I built to expand a BitSet both options are available). 

Creating variable length fingerprints may fall into the category of just because I can do it doesn't mean I should. 

Hi Sam,

thanks for the answer. The size of the BitVector can be different as long as the size within a group is the same. I will open a feature request to support also BitVectors with varying lengths. If this is not easy to achieve the node should at least throw a more legible error message.

Bye,

Tobias