Unable to abort processing, Workspace blocked & requiring reboot

mwiegand · February 8, 2024, 5:56pm

Hi,

In order to back test improvements I created a workflow to exponentially increase the data (1 Row with some JSON). I encountered the issue before that Knime cannot finish due to the amount but I am equally unable to abort the processing.

Whilst I am able to close Knime and no background process is running, it seems some sort of lock file is still in place blocking the workspace. Hence, I am forced to restart my whole system.

Best
Mike

armingrudd · February 27, 2024, 9:56am

Hi @mwiegand and thank you for reporting this.
Is it possible to provide us with the example workflow with which we can reproduce the problem?

mwiegand · February 28, 2024, 7:50am

Hi Armin,

here you go. Let the Knime-Drag-Race beign!

I tested my systems limits and 10 Million data points seems to be the limit where it starts to struggle executing the workflow. Worth to mention that this performance issue was raised in other posts before.

Interestingly, though, several posts mention the arbitrary “sound barrier” of around 10 Million rows. Feels too much of a coincidence and I do recall, which dates back several major Knime versions by now, that handling several tens of million rows with hundreds of columns wasn’t even a problem on my old Mac Book Pro from 2016 with a mere 16 GB of memory.

Increase the count in the recursive loop end for greater steps and the one of the counting loop for fine granularity.

By the way, here is my system and knime config. I also tested different Xmx settings and Knime copes well with 2 and 50 GB ram allocation showing only marginal performance differences executing the workflow.

-startup
plugins/org.eclipse.equinox.launcher_1.6.400.v20210924-0641.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.2.700.v20221108-1024
-vm
plugins/org.knime.binary.jre.win32.x86_64_17.0.5.20221116/jre/bin/server/jvm.dll
--launcher.defaultAction
openFile
-vmargs
-Djava.security.properties=plugins/org.knime.binary.jre.win32.x86_64_17.0.5.20221116/security.properties
-Dorg.apache.cxf.bus.factory=org.knime.cxf.core.fragment.KNIMECXFBusFactory
-Djdk.httpclient.allowRestrictedHeaders=content-length
-Darrow.enable_unsafe_memory_access=true
-Darrow.memory.debug.allocator=false
-Darrow.enable_null_check_for_get=false
--add-opens=java.security.jgss/sun.security.jgss.krb5=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.jgss=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.jgss.spi=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.krb5.internal=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
--add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED
--add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED
-Djdk.httpclient.allowRestrictedHeaders=content-length
-Dorg.apache.cxf.bus.factory=org.knime.cxf.core.fragment.KNIMECXFBusFactory
-Dorg.apache.cxf.transport.http.forceURLConnection=true
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Dknime.xml.disable_external_entities=true
-Dcomm.disable_dynamic_service=true
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.nio.channels=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio=ALL-UNNAMED
--add-opens=java.desktop/javax.swing.plaf.basic=ALL-UNNAMED
--add-opens=java.base/sun.net.www.protocol.http=ALL-UNNAMED
--add-opens=java.base/sun.net.www.protocol.https=ALL-UNNAMED
-Xmx50g
-Dorg.eclipse.swt.browser.IEVersion=11001
-Dsun.awt.noerasebackground=true
-Dequinox.statechange.timeout=30000
-Dorg.knime.container.cellsinmemory=10000000
-Dknime.compress.io=false

AMD Ryzen 7950X (16 Cores, 32 Threads)
64 GB DDR5 6000 Mhz (tCAS-tRCD-tRP-tRAS): 32-38-38-96)
Note: Pretty fast memory which Ryzen needs to not “choke” itself
Two NVMe (PCIe x4 16.0 GT/s @ x4 16.0 GT/s) WD Black SN850X 2TB each
Note: The do not saturate the PCIe Gen5 lanes directly hocked up to the CPU!
NVIDIA GeForce RTX 3080 Ti with 12 GB GDDR6X (PCIe v4.0 x16 (16.0 GT/s) @ x16 (5.0 GT/s))
Note. Not saturating the PCIe Gen 5 lanes
Windows 11 Pro
Note: TRIM enabled, Debloated so all stubborn Microsoft and other bloatware got removed and more tuning done.

The saving process seems to be divided into several steps as upon triggering it, CPU but not Disk usage increases letting me to assume compression is happening. Though, I’d believe all data is already available and compressed in Knime tables or so. Disk usage actually never spikes during save.

PPS: I have my workspace located on the 2nd SSD separate from the OS. Still, I see data being cached in the users cache folder. That further strengthens the approach to split disk load but equally raises the question of possible data transfer bottle necks. Worth to note that disk usage didn’t exceed 1 % and I tried putting strain on disk utilization before.

Best
Mike

mlauber71 · February 28, 2024, 8:35am

@mwiegand out of curiosity. Have you tried running this workflow and compare the several parts. Maybe one by one. What this does to your system?

mwiegand · February 28, 2024, 11:17am

@mlauber71 I did execute the inner loop facing little to no compute bottle neck. Though, that workflow is only for testing but not serving any apparent real live purpose.

Switching from row to columnar based backend I happen to notice a stark difference in processing time as well. Using the benchmark nodes, only running one iteration, the new columnar backend shows a stark regression.

The concatenate went from 0 to >600 ms!!!

Default: Row Based Backend

Iteration	Start Time	End Time	Execution Time (s)	Node name / ID	Node Execution time (ms)
1	2024-02-28T12:11:46.369+01:00[Europe/Berlin]	2024-02-28T12:12:08.974+01:00[Europe/Berlin]	22.605	Recursive Loop Start 3:3	341
1	2024-02-28T12:11:46.369+01:00[Europe/Berlin]	2024-02-28T12:12:08.974+01:00[Europe/Berlin]	22.605	Counting Loop Start 3:5	0
1	2024-02-28T12:11:46.369+01:00[Europe/Berlin]	2024-02-28T12:12:08.974+01:00[Europe/Berlin]	22.605	Concatenate 3:7	0
1	2024-02-28T12:11:46.369+01:00[Europe/Berlin]	2024-02-28T12:12:08.974+01:00[Europe/Berlin]	22.605	Loop End 3:6	5334
1	2024-02-28T12:11:46.369+01:00[Europe/Berlin]	2024-02-28T12:12:08.974+01:00[Europe/Berlin]	22.605	Recursive Loop End 3:4	5242

Columnar Backend

Iteration	Start Time	End Time	Execution Time (s)	Node name / ID	Node Execution time (ms)
1	2024-02-28T12:08:37.788+01:00[Europe/Berlin]	2024-02-28T12:10:15.021+01:00[Europe/Berlin]	97.233	Recursive Loop Start 3:3	3455
1	2024-02-28T12:08:37.788+01:00[Europe/Berlin]	2024-02-28T12:10:15.021+01:00[Europe/Berlin]	97.233	Counting Loop Start 3:5	0
1	2024-02-28T12:08:37.788+01:00[Europe/Berlin]	2024-02-28T12:10:15.021+01:00[Europe/Berlin]	97.233	Concatenate 3:7	623
1	2024-02-28T12:08:37.788+01:00[Europe/Berlin]	2024-02-28T12:10:15.021+01:00[Europe/Berlin]	97.233	Loop End 3:6	13593
1	2024-02-28T12:08:37.788+01:00[Europe/Berlin]	2024-02-28T12:10:15.021+01:00[Europe/Berlin]	97.233	Recursive Loop End 3:4	36609

I updated the test workflow.

Cheers
Mike

mlauber71 · February 28, 2024, 12:44pm

@mwiegand question is are you testing very large loops in general or large files in KNIME?

mwiegand · February 28, 2024, 1:35pm

@mlauber71 what do you mean with large loops?

mlauber71 · February 28, 2024, 4:03pm

@mwiegand is the problem you have with performance mainly present when you use loop or loops with a large amount of data or is there a general problem with large datasets.

What happens if you run the workflow I have provided?

mlauber71 · February 28, 2024, 4:12pm

@mwiegand it might be that this is a windows thing as you have mentioned before?

My M1 with the generic Apple Silicon version of KNIME 5.2.1 and 24 GB RAM did this:

Iteration	Start Time	End Time	Execution Time (s)	Node name / ID	Node Execution time (ms)
1	2024-02-28T17:05:11.526+01:00[Europe/Berlin]	2024-02-28T17:05:48.100+01:00[Europe/Berlin]	36,574	Recursive Loop Start 3:3	508
1	2024-02-28T17:05:11.526+01:00[Europe/Berlin]	2024-02-28T17:05:48.100+01:00[Europe/Berlin]	36,574	Counting Loop Start 3:5	1
1	2024-02-28T17:05:11.526+01:00[Europe/Berlin]	2024-02-28T17:05:48.100+01:00[Europe/Berlin]	36,574	Concatenate 3:7	1
1	2024-02-28T17:05:11.526+01:00[Europe/Berlin]	2024-02-28T17:05:48.100+01:00[Europe/Berlin]	36,574	Loop End 3:6	7321
1	2024-02-28T17:05:11.526+01:00[Europe/Berlin]	2024-02-28T17:05:48.100+01:00[Europe/Berlin]	36,574	Recursive Loop End 3:4	8304

The resulting file is 470 MB as a KNIME table which is large but not extraordinarily large …

mwiegand · February 28, 2024, 4:51pm

The issue, not being able to abort processing, is mostly present when using loops. I just had a data set with around 19 million rows (around 1.8 GB as a knime table file), containing XML, from which I extracted data using XPath. Aborting didn’t happen immediately but concluded after some time.

Trying to save the data from my provided test workflow, disk usage briefly went up, on the disk where the workspace resides and windows, but then went flat but the writer node was still working.

Shockingly, it produced a +7GB file while yours is a mere 470 MB Could it relate to my knime.ini setting to disable compression?

I didn’t had the opportunity to test your workflows. Busy times as usual.

Best
Mike

mlauber71 · February 28, 2024, 4:59pm

Yes most likely. Do you have your modified test workflow?

Also if the data is not too exotic it might help to use parquet files in chunks - not a solution I know but maybe it is more stable and you can later use it as one file.

mwiegand · February 28, 2024, 5:36pm

It the initial one, I always update it except if it would be stupid

I ran the workflow with -Dknime.compress.io=true and presto, data saved got condensed to astoninglishly small 70 MB. Wonder why yours is +400 MB, though.

I also happen to notice that the write process is blocking any data preview.

But then I noticed it won’t load et all. Same for the stats data on port 0 of the benchmark end node.

Saving and retry- as well as closing knime nor resetting and reexecuting the nodes didn’t resolve it.