Unable to abort processing, Workspace blocked & requiring reboot

Hi,

In order to back test improvements I created a workflow to exponentially increase the data (1 Row with some JSON). I encountered the issue before that Knime cannot finish due to the amount but I am equally unable to abort the processing.

Whilst I am able to close Knime and no background process is running, it seems some sort of lock file is still in place blocking the workspace. Hence, I am forced to restart my whole system.

Best
Mike

Hi @mwiegand and thank you for reporting this.
Is it possible to provide us with the example workflow with which we can reproduce the problem?

Hi Armin,

here you go. Let the Knime-Drag-Race beign!

I tested my systems limits and 10 Million data points seems to be the limit where it starts to struggle executing the workflow. Worth to mention that this performance issue was raised in other posts before.

Interestingly, though, several posts mention the arbitrary “sound barrier” of around 10 Million rows. Feels too much of a coincidence and I do recall, which dates back several major Knime versions by now, that handling several tens of million rows with hundreds of columns wasn’t even a problem on my old Mac Book Pro from 2016 with a mere 16 GB of memory.

Increase the count in the recursive loop end for greater steps and the one of the counting loop for fine granularity.

By the way, here is my system and knime config. I also tested different Xmx settings and Knime copes well with 2 and 50 GB ram allocation showing only marginal performance differences executing the workflow.

-startup
plugins/org.eclipse.equinox.launcher_1.6.400.v20210924-0641.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.2.700.v20221108-1024
-vm
plugins/org.knime.binary.jre.win32.x86_64_17.0.5.20221116/jre/bin/server/jvm.dll
--launcher.defaultAction
openFile
-vmargs
-Djava.security.properties=plugins/org.knime.binary.jre.win32.x86_64_17.0.5.20221116/security.properties
-Dorg.apache.cxf.bus.factory=org.knime.cxf.core.fragment.KNIMECXFBusFactory
-Djdk.httpclient.allowRestrictedHeaders=content-length
-Darrow.enable_unsafe_memory_access=true
-Darrow.memory.debug.allocator=false
-Darrow.enable_null_check_for_get=false
--add-opens=java.security.jgss/sun.security.jgss.krb5=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.jgss=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.jgss.spi=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.krb5.internal=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
--add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED
--add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED
-Djdk.httpclient.allowRestrictedHeaders=content-length
-Dorg.apache.cxf.bus.factory=org.knime.cxf.core.fragment.KNIMECXFBusFactory
-Dorg.apache.cxf.transport.http.forceURLConnection=true
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Dknime.xml.disable_external_entities=true
-Dcomm.disable_dynamic_service=true
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.nio.channels=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio=ALL-UNNAMED
--add-opens=java.desktop/javax.swing.plaf.basic=ALL-UNNAMED
--add-opens=java.base/sun.net.www.protocol.http=ALL-UNNAMED
--add-opens=java.base/sun.net.www.protocol.https=ALL-UNNAMED
-Xmx50g
-Dorg.eclipse.swt.browser.IEVersion=11001
-Dsun.awt.noerasebackground=true
-Dequinox.statechange.timeout=30000
-Dorg.knime.container.cellsinmemory=10000000
-Dknime.compress.io=false
  • AMD Ryzen 7950X (16 Cores, 32 Threads)
  • 64 GB DDR5 6000 Mhz (tCAS-tRCD-tRP-tRAS): 32-38-38-96)
    Note: Pretty fast memory which Ryzen needs to not “choke” itself
  • Two NVMe (PCIe x4 16.0 GT/s @ x4 16.0 GT/s) WD Black SN850X 2TB each
    Note: The do not saturate the PCIe Gen5 lanes directly hocked up to the CPU!
  • NVIDIA GeForce RTX 3080 Ti with 12 GB GDDR6X (PCIe v4.0 x16 (16.0 GT/s) @ x16 (5.0 GT/s))
    Note. Not saturating the PCIe Gen 5 lanes
  • Windows 11 Pro
    Note: TRIM enabled, Debloated so all stubborn Microsoft and other bloatware got removed and more tuning done.

The saving process seems to be divided into several steps as upon triggering it, CPU but not Disk usage increases letting me to assume compression is happening. Though, I’d believe all data is already available and compressed in Knime tables or so. Disk usage actually never spikes during save.

PPS: I have my workspace located on the 2nd SSD separate from the OS. Still, I see data being cached in the users cache folder. That further strengthens the approach to split disk load but equally raises the question of possible data transfer bottle necks. Worth to note that disk usage didn’t exceed 1 % and I tried putting strain on disk utilization before.

Best
Mike

1 Like

@mwiegand out of curiosity. Have you tried running this workflow and compare the several parts. Maybe one by one. What this does to your system?

@mlauber71 I did execute the inner loop facing little to no compute bottle neck. Though, that workflow is only for testing but not serving any apparent real live purpose.

Switching from row to columnar based backend I happen to notice a stark difference in processing time as well. Using the benchmark nodes, only running one iteration, the new columnar backend shows a stark regression.

The concatenate went from 0 to >600 ms!!!

Default: Row Based Backend

Iteration Start Time End Time Execution Time (s) Node name / ID Node Execution time (ms)
1 2024-02-28T12:11:46.369+01:00[Europe/Berlin] 2024-02-28T12:12:08.974+01:00[Europe/Berlin] 22.605 Recursive Loop Start 3:3 341
1 2024-02-28T12:11:46.369+01:00[Europe/Berlin] 2024-02-28T12:12:08.974+01:00[Europe/Berlin] 22.605 Counting Loop Start 3:5 0
1 2024-02-28T12:11:46.369+01:00[Europe/Berlin] 2024-02-28T12:12:08.974+01:00[Europe/Berlin] 22.605 Concatenate 3:7 0
1 2024-02-28T12:11:46.369+01:00[Europe/Berlin] 2024-02-28T12:12:08.974+01:00[Europe/Berlin] 22.605 Loop End 3:6 5334
1 2024-02-28T12:11:46.369+01:00[Europe/Berlin] 2024-02-28T12:12:08.974+01:00[Europe/Berlin] 22.605 Recursive Loop End 3:4 5242

Columnar Backend

Iteration Start Time End Time Execution Time (s) Node name / ID Node Execution time (ms)
1 2024-02-28T12:08:37.788+01:00[Europe/Berlin] 2024-02-28T12:10:15.021+01:00[Europe/Berlin] 97.233 Recursive Loop Start 3:3 3455
1 2024-02-28T12:08:37.788+01:00[Europe/Berlin] 2024-02-28T12:10:15.021+01:00[Europe/Berlin] 97.233 Counting Loop Start 3:5 0
1 2024-02-28T12:08:37.788+01:00[Europe/Berlin] 2024-02-28T12:10:15.021+01:00[Europe/Berlin] 97.233 Concatenate 3:7 623
1 2024-02-28T12:08:37.788+01:00[Europe/Berlin] 2024-02-28T12:10:15.021+01:00[Europe/Berlin] 97.233 Loop End 3:6 13593
1 2024-02-28T12:08:37.788+01:00[Europe/Berlin] 2024-02-28T12:10:15.021+01:00[Europe/Berlin] 97.233 Recursive Loop End 3:4 36609

I updated the test workflow.

Cheers
Mike

@mwiegand question is are you testing very large loops in general or large files in KNIME?

@mlauber71 what do you mean with large loops?

@mwiegand is the problem you have with performance mainly present when you use loop or loops with a large amount of data or is there a general problem with large datasets.

What happens if you run the workflow I have provided?

@mwiegand it might be that this is a windows thing as you have mentioned before?

My M1 with the generic Apple Silicon version of KNIME 5.2.1 and 24 GB RAM did this:

Iteration Start Time End Time Execution Time (s) Node name / ID Node Execution time (ms)
1 2024-02-28T17:05:11.526+01:00[Europe/Berlin] 2024-02-28T17:05:48.100+01:00[Europe/Berlin] 36,574 Recursive Loop Start 3:3 508
1 2024-02-28T17:05:11.526+01:00[Europe/Berlin] 2024-02-28T17:05:48.100+01:00[Europe/Berlin] 36,574 Counting Loop Start 3:5 1
1 2024-02-28T17:05:11.526+01:00[Europe/Berlin] 2024-02-28T17:05:48.100+01:00[Europe/Berlin] 36,574 Concatenate 3:7 1
1 2024-02-28T17:05:11.526+01:00[Europe/Berlin] 2024-02-28T17:05:48.100+01:00[Europe/Berlin] 36,574 Loop End 3:6 7321
1 2024-02-28T17:05:11.526+01:00[Europe/Berlin] 2024-02-28T17:05:48.100+01:00[Europe/Berlin] 36,574 Recursive Loop End 3:4 8304

The resulting file is 470 MB as a KNIME table which is large but not extraordinarily large …

The issue, not being able to abort processing, is mostly present when using loops. I just had a data set with around 19 million rows (around 1.8 GB as a knime table file), containing XML, from which I extracted data using XPath. Aborting didn’t happen immediately but concluded after some time.

Trying to save the data from my provided test workflow, disk usage briefly went up, on the disk where the workspace resides and windows, but then went flat but the writer node was still working.

Shockingly, it produced a +7GB file while yours is a mere 470 MB :exploding_head: Could it relate to my knime.ini setting to disable compression?

I didn’t had the opportunity to test your workflows. Busy times as usual.

Best
Mike

Yes most likely. Do you have your modified test workflow?

Also if the data is not too exotic it might help to use parquet files in chunks - not a solution I know but maybe it is more stable and you can later use it as one file.

It the initial one, I always update it except if it would be stupid :wink:

I ran the workflow with -Dknime.compress.io=true and presto, data saved got condensed to astoninglishly small 70 MB. Wonder why yours is +400 MB, though.

I also happen to notice that the write process is blocking any data preview.

But then I noticed it won’t load et all. Same for the stats data on port 0 of the benchmark end node.

Saving and retry- as well as closing knime nor resetting and reexecuting the nodes didn’t resolve it.