Options for applying an operation to individual cells of a large table

Aswin · May 16, 2019, 4:24pm

Hi knimers,

suppose we have a table with may rows and many similar columns. How does one efficiently apply an operation to every individual cell?

For example, suppose we have a 2000x500 table of Doubles. We want to multiply every cell in that table with 2. We could use a multi-column math node, but let’s try to avoid that node for now. We put our multiplication in a Java Snippet to make the solutions perfectly general. A beginning Knime-user will likely go for a Column List Loop:

It works, but it is very slow. On my pc, it needs 201 seconds, more than 3 minutes.

If we put the Column List Loop inside the Chunk Loop, the whole thing is a bit faster:

This one is done in 51 seconds.

A more efficient option is to loop over the table row by row, but then the row needs to be transposed within the loop, and then transposed back after the operation. To get our original row IDs back, we have to apply some node gymnastics:

Much faster: 25 seconds. But kind of ugly. Alternatively, we can assume that the table consists of single-member groups and use a Group Loop. The name of the group is our rowID.

It looks somewhat more elegant, but is slightly slower: 28 seconds.

What if we put all the values in a single column using an unpivot node, apply our operation, and then pivot again?

Done in 30 seconds. It also looks quite elegant. Can this be made faster by using chunks?

Apparently yes: only 17 seconds.

The slow step in the unpivot - pivot method seems to be the pivot step. If we can replace this with some GroupBy sorcery, we may be able to speed it up even more:

Done in 11 seconds. With chunking:

…we can reduce this to 8 seconds.

So this is pretty fast, but it also looks pretty ugly. Remember, we are trying to do something very simple: apply an operation to each individual cell of a multi-column table, so it would be nice if this simplicity has a simple representation in the workflow. Using Group Loop is easier than a GroupBy node and looks more elegant:

Elegant, but not faster: 28 seconds.

The last option combines all columns in a list/collection cell before applying the multiplication. This requires us to rewrite the code in the Java Snippet to operate on an Array instead of single values:

Java snippet content:

Double[] tbl = $AggregatedValues$;
int n = tbl.length;
for(int t=0; t<n; t++) {
    tbl[t] = tbl[t] * 2.0;
}

return tbl;

This is the fastest solution by far: 2 seconds.

How does this compare with the Math Formula (Multi Column) node? Surprisingly, this is significantly slower than our optimal method: 14 seconds!

Best,
Aswin

p.s. the workflow with all the different methods:
KNIME_project.knwf (285.1 KB)

beginner · May 16, 2019, 4:55pm

Awesome post. KNIME isn’t great at matrix operations. Fully agree. You either have to be creative as you were or pay the serialization price. Another trivial option is a python script node:

output_table = input_table * 2

On my PC for 2000x500 that sadly takes 7.7 seconds. So not the fastest. Of course a lot of time is spent on serialization. Would be cool if you could try this as well on your machine. My CPU is old and it could matter a lot if you have a much never one.

Aswin · May 16, 2019, 5:09pm

@beginner Good point, I forgot about that option. On my pc, performing the operation in Python takes 7.2 seconds, faster than I expected. In pure Python this would be near instantaneous… Indeed serialization is the slow step here, even for really small tables (see for example here).

With R it is even faster: 3.5 seconds. Puzzling.

ipazin · May 16, 2019, 9:53pm

Nice analysis. This is a bookmark topic
Ivan

Aswin · May 17, 2019, 8:04am

I attached the workflow with all the different methods to the first post. I’d be very interested if somebody can come up with a faster/more elegant way!

Aswin · May 17, 2019, 8:23am

The fastest option above is cheating a bit because most nodes cannot directly manipulate list cell elements, so we would be restricted to using Java snippets. A more conventional variation would be:

This method takes 9 seconds.

beginner · May 17, 2019, 10:00am

I don’t have a better solution but a quick search found that there is a java library similar to numpy:

nd4j which is actually part of dl4j which is an extension already available in knime.

My suggestion for KNIME devs would be to use that library to make knime nodes for matrix calculations. That would ultimately be the desired solution.

Aswin · May 17, 2019, 12:18pm

Using parallel chunks does not help much: 8 seconds instead of 9. Someone once told me that java snippets cannot be executed in parallel, maybe that is the reason.

Because the Parallel Chunk loop renames the rowIDs, some extra complexity is necessary here in order to get the old rowIDs back.

Aswin · May 22, 2019, 9:12am

Instead of a Java Snippet one could use a Column Expressions node. It is almost as fast (3 seconds) but not any simpler because it is still required to store the result of the multiplication in a list cell which needs to be expanded afterwards.

Column expression code:

dum = arrayCreate(0.0)
dum = arrayRemove(dum, 0)
for(x in columnNames()) {dum = arrayAdd(dum, column(x)*2)}
dum

(by the way, is there a more elegant way to create an empty array in a column expressions node?)

p.s. kudos to @ipazin whose comment in a different thread gave me this idea

ipazin · May 22, 2019, 10:30am

Hi!

tried but failed couple of times

Br,
Ivan

system · November 20, 2019, 10:30pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.