Hello all. I’m using Knime to do text processing against helpdesk comments between customers and engineers, with the goal of extracting common themes. The difficulty I have is that Knime is extremely slow or hangs for some nodes, so far the below takes 30+ minutes to several hours, unless Knime hangs entirely.
- Flat File Document Parser
- Term Frequency
- Topic Extractor
- Chi-Square Keyword Extractor
I’ve done the following to improve performance:
- Referenced blog post Optimizing Knime workflows for performance
- knime.ini file is
-startup
plugins/org.eclipse.equinox.launcher_1.3.200.v20160318-1642.jar
–launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.400.v20160518-1444
-vm
plugins/org.knime.binary.jre.win32.x86_64_1.8.0.152-01/jre/bin
–launcher.defaultAction
openFile
-vmargs
-XX:MaxPermSize=2048m
-Dorg.knime.container.cellsinmemory=10000000
–Dknime.compress.io=false
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx10240m
-Dorg.eclipse.swt.browser.IEVersion=10001
-Dsun.awt.noerasebackground=true
-Dequinox.statechange.timeout=30000 - Knime is running on a laptop with 16 GB memory
- Have modified long running nodes to use “keep all in memory” instead of “keep only small tables in memory”.
The file size for a csv file with a one year date range is 90 MBs typically crashes the LDA node, and a sample size of one day (only 27 kbs) took 20 minutes to complete. What would be the expected completion time for these nodes?
When crashing, usually there is a GC overhead limit error such as below. I’ve tried raising and lowering the GC param value.
ERROR Keygraph Keyword Extractor 0:79 Execute failed: GC overhead limit exceeded
WARN Chi-Square Keyword Extractor 0:78 Execution canceled
ERROR Topic Extractor (Parallel LDA) 0:61 Execute failed: GC overhead limit exceeded
Unfortunately there are a lot of custom stopwords that I need to filter out, so the work is extremely iterative. It would be helpful if I could complete the workflow from start to finish in a shorter timeframe.
Thanks, Terry