Benchmarking KNIME with XY data

Aswin · August 30, 2021, 9:52am

Dear Knimers,

Suppose I have a table with 10 million rows of “XY - data”:

…and let’s say I want to remove the edges, rows x<100.0 and X > 999900.0. I could use a row filter:

This takes 16.1 seconds on my computer.

I can do the same thing using a Rule-based Row Filter (19.2 seconds) and a Java Snippet Row Filter (27.8 seconds).

I could also combine all the rows into a pair of lists:

In that case I could slice the data using a Java Snippet:

int l = c_X.length;
int num = 0;
for(int t=0; t<l; t++) {
	num += ((c_X[t] >= 100.0) && (c_X[t] <= 999900.0)) ? 1 : 0;
}	
Double[] Xresult = new Double[num];
Double[] Yresult = new Double[num];

int u = 0;
for(int t=0; t<l; t++) {
	if ((c_X[t] >= 100.0) && (c_X[t] <= 999900.0)) {
		Xresult[u] = c_X[t];
		Yresult[u] = c_Y[t];
		u++;
	}
}	
out_X = Xresult;
out_Y = Yresult;

Significantly faster: 5.7 seconds. The risk with this method is that it is very easy to slice the two lists differently by mistake, so that the user ends up with two inconsistent lists without noticing. I tried to avoid that by making a list of XY pairs, but I did not succeed into using such a list in a Java snippet (unsupported column type).

Now, let’s try the same slice within pure Python with a Pandas dataframe. This only takes 0.5 seconds:

tm1 = time.time()
df = df[(df['X']>=100.0) & (df['X']<=999900.0)]
tm2 = time.time()
print(tm2-tm1)

Using the exact same Python code in a Python Script node in Knime takes 152 seconds.

Changing to the new columnar table backend changes the execution times a bit, but not spectacularly:
Row filter: 13.5 s
Rule-based row filter: 12.5 s
Java snippet Row Filter: 22.0 s
Using a pair of lists and a Java Snippet: 9.8 s
Using a Python Script node: 156 s

Interestingly, using a pair of Lists has become significantly worse.

Knime 4.3.3 on Ubuntu 20.04, Intel i7-10710U (12) @ 2.000GHz, 64 GB RAM, -Xmx24576m, -Dorg.knime.container.cellsinmemory=25000000

Cheers
Aswin

christian.birkhold · August 30, 2021, 11:07am

Hi @Aswin,

that’s interesting. Thanks for sharing. Some comments to share

Regarding the python scripting node performance: There are some tricks to make that faster with the current implementation (e.g. use arrow backend in preference pages, adjust chunk size in python node etc) but you will hit limits. That’s why we’re currently working on making use of the new columnar backend in combination with the python scripting node - this will speed up things. The bottleneck with the current implementation is not the execution of the script, rather the transfer from KNIME to Python and back. Out of curiosity: Did you also add data import and data export when benchmarking your pure Python script?
Using the new columnar backend improved performance by ~30% if I see that correctly? There will be further improvements especially to nodes like Row Filter or GroupBy when we actually implement the new columnar backend in the nodes. At the moment we have kind of a “legacy mode” which makes the current implementations of our node compatible with the “new columnar backend”. Also, there are some parameters you can adjust on the preference page of the new columnar backend, e.g. increase the off-heap memory used to cache data and such. Might help speeding things up further.

Thanks again for this feedback, soonish you’ll be able to re-run some of the benchmark and get more satisfying results.

Christian

Aswin · August 30, 2021, 11:52am

Dear @christian.dietz ,

Indeed, the columnar backend is a bit faster for regular tables, but not for the pair-of-lists case. The pair-of-lists format sliced with a Java Snippet with the old backend is the fastest method.
I had already set the rows-per-chunk for the Python script node to 10,000,000
Using Arrow reduced the Python Script node execution time to 62.0 seconds (together with the columnar backend) - still kind of sluggish, but nevertheless an impressive improvement
I did not include data import and export in my pure Python script benchmarking, because I wanted to compare the execution time of a Row Filter node in a Knime workflow on one hand, and a line of slicing code in a hypothetical equivalent pure Python data procession script on the other.

I am aware that, in contrast with a slicing instruction in a Python script, Knime automatically stores the result of the Row Filter… unless streaming is used. But somehow streaming does not work any more in my setup. Did that happen because I installed the columnar backend? I could imagine that streaming is row-based and conflicts with the columnar backend, but streaming is not even available when I don’t use the columnar backend.

Looking forward to the future improvements!

Best
Aswin