Knime hangs/very slow when running the LDA node and other text processing nodes for large text files

Hello all. I’m using Knime to do text processing against helpdesk comments between customers and engineers, with the goal of extracting common themes. The difficulty I have is that Knime is extremely slow or hangs for some nodes, so far the below takes 30+ minutes to several hours, unless Knime hangs entirely.

  • Flat File Document Parser
  • Term Frequency
  • Topic Extractor
  • Chi-Square Keyword Extractor

I’ve done the following to improve performance:

  • knime.ini file is
    -startup
    plugins/org.eclipse.equinox.launcher_1.3.200.v20160318-1642.jar
    –launcher.library
    plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.400.v20160518-1444
    -vm
    plugins/org.knime.binary.jre.win32.x86_64_1.8.0.152-01/jre/bin
    –launcher.defaultAction
    openFile
    -vmargs
    -XX:MaxPermSize=2048m
    -Dorg.knime.container.cellsinmemory=10000000
    –Dknime.compress.io=false
    -server
    -Dsun.java2d.d3d=false
    -Dosgi.classloader.lock=classname
    -XX:+UnlockDiagnosticVMOptions
    -XX:+UnsyncloadClass
    -Dsun.net.client.defaultReadTimeout=0
    -XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
    -Xmx10240m
    -Dorg.eclipse.swt.browser.IEVersion=10001
    -Dsun.awt.noerasebackground=true
    -Dequinox.statechange.timeout=30000
  • Knime is running on a laptop with 16 GB memory
  • Have modified long running nodes to use “keep all in memory” instead of “keep only small tables in memory”.

The file size for a csv file with a one year date range is 90 MBs typically crashes the LDA node, and a sample size of one day (only 27 kbs) took 20 minutes to complete. What would be the expected completion time for these nodes?

When crashing, usually there is a GC overhead limit error such as below. I’ve tried raising and lowering the GC param value.
ERROR Keygraph Keyword Extractor 0:79 Execute failed: GC overhead limit exceeded
WARN Chi-Square Keyword Extractor 0:78 Execution canceled
ERROR Topic Extractor (Parallel LDA) 0:61 Execute failed: GC overhead limit exceeded

Unfortunately there are a lot of custom stopwords that I need to filter out, so the work is extremely iterative. It would be helpful if I could complete the workflow from start to finish in a shorter timeframe.

Thanks, Terry

Hey Terry,

thanks for the information. I will have a look into the issues you reported. The LDA shouldn’t take that long. We already have a ticket within our ticket system regarding the hanging of the LDA node.
Did you try to close and reopen the workflow and re-execute the LDA node? Sometimes that works, but indeed that’s not how it should be.
I will get back to you if I have some news for you.

Cheers,

Julian