Knime hangs/very slow when running the LDA node and other text processing nodes for large text files

terryatct · June 26, 2018, 8:21pm

Hello all. I’m using Knime to do text processing against helpdesk comments between customers and engineers, with the goal of extracting common themes. The difficulty I have is that Knime is extremely slow or hangs for some nodes, so far the below takes 30+ minutes to several hours, unless Knime hangs entirely.

Flat File Document Parser
Term Frequency
Topic Extractor
Chi-Square Keyword Extractor

I’ve done the following to improve performance:

Referenced blog post Optimizing Knime workflows for performance

knime.ini file is
-startup
plugins/org.eclipse.equinox.launcher_1.3.200.v20160318-1642.jar
–launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.400.v20160518-1444
-vm
plugins/org.knime.binary.jre.win32.x86_64_1.8.0.152-01/jre/bin
–launcher.defaultAction
openFile
-vmargs
-XX:MaxPermSize=2048m
-Dorg.knime.container.cellsinmemory=10000000
–Dknime.compress.io=false
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx10240m
-Dorg.eclipse.swt.browser.IEVersion=10001
-Dsun.awt.noerasebackground=true
-Dequinox.statechange.timeout=30000
Knime is running on a laptop with 16 GB memory
Have modified long running nodes to use “keep all in memory” instead of “keep only small tables in memory”.

The file size for a csv file with a one year date range is 90 MBs typically crashes the LDA node, and a sample size of one day (only 27 kbs) took 20 minutes to complete. What would be the expected completion time for these nodes?

When crashing, usually there is a GC overhead limit error such as below. I’ve tried raising and lowering the GC param value.
ERROR Keygraph Keyword Extractor 0:79 Execute failed: GC overhead limit exceeded
WARN Chi-Square Keyword Extractor 0:78 Execution canceled
ERROR Topic Extractor (Parallel LDA) 0:61 Execute failed: GC overhead limit exceeded

Unfortunately there are a lot of custom stopwords that I need to filter out, so the work is extremely iterative. It would be helpful if I could complete the workflow from start to finish in a shorter timeframe.

Thanks, Terry

julian.bunzel · June 27, 2018, 7:12pm

Hey Terry,

thanks for the information. I will have a look into the issues you reported. The LDA shouldn’t take that long. We already have a ticket within our ticket system regarding the hanging of the LDA node.
Did you try to close and reopen the workflow and re-execute the LDA node? Sometimes that works, but indeed that’s not how it should be.
I will get back to you if I have some news for you.

Cheers,

Julian

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.