Potential deadlock in SWT Display thread detected

Edlueze · May 4, 2020, 10:03am

I’ve started getting a “Potential deadlock in SWT Display thread detected” message in my knime.log file when I start running a large workflow. I recently tried to simplify the workflow by adding about 5 inner loops, but that’s when the problems started. I’m attaching my knime.log. Any pointers to a solution would be welcome!

knime_001.log (97.7 KB)

Edlueze · May 4, 2020, 11:27am

To add a bit of context to this problem, my workflow has been grouped into about 10 metanodes surrounded by an outer-loop. If I start each metanode one-by-one, then do a final “run all” for the outer-loop, the workflow seems to work fine. It is only if I try to “run all” at the beginning do I get the problem. I looked for a memory problem in the log file but didn’t see one. I’m running the workflow on a dedicated 32 GB machine (with 20 GB devoted to KNIME in the knime.ini file).

quaeler · May 4, 2020, 3:32pm

While lame, is there a data-functional failure that is occurring with this? (e.g KNIME is failing to complete the workflow execution or is producing incorrect results.)

Edlueze · May 5, 2020, 2:02am

Hi Quaeler - thanks for your response! My problem is more of a failure to start than a failure to finish. The “Potential deadlock in SWT Display thread detected” occurs within the first 2 minutes of execution (execution normally takes 8 hours). The cursor in the KNIME GUI just spins, and the nodes don’t reach the RUNNING state.

Edlueze · May 5, 2020, 3:04am

There is a lot of CPU activity, but a check of the database shows that no data is being written.

Edlueze · May 5, 2020, 9:16am

I’ve still not found a good solution to this problem. But can anybody tell me, if I simply add more resources will this problem go away? I’m currently running 4 cores x 32 GB. If I ran 8 cores x 64 GB should things get better?

quaeler · May 5, 2020, 2:58pm

Hmm… yes - that’s a true problem then (if the nodes are not reaching the RUNNING state.) Adding more computing resources will almost certainly not do anything to address this, it is probably a code issue that you’ve found (e.g two different dependent things both vying & blocking on a UI update from different threads.)
IIRC, that “Potential deadlock” is a check being done in KNIME code (as opposed to something being emitted by Eclipse libraries). It looks like the stack trace is being logged at DEBUG level, could you enable that level of logging and report back with the stack traces?

Edlueze · May 6, 2020, 1:23am

Hi quaeler - the logging in the log file I already posted was being done at the DEBUG level. But I will run it again and send you a complete sequence from starting KNIME, to opening the workflow, to running. Is there other logging I can send you? Perhaps from the Eclipse side?

Is it possible the problem is related to the large number of nodes I schedule at the beginning of a run? If I pack the MetaNodes into independent workflows that are then called locally will that help to stagger the run? I ran a quick experiment with this but hit a “Execute failed: Java heap space
java.lang.OutOfMemoryError: Java heap space” and didn’t investigate further.

quaeler · May 6, 2020, 1:41am

I’m sorry - i had forgotten you’d originally posted the log - i see it contains a thread dump which i’ll look at this evening – thanks.

quaeler · May 6, 2020, 2:39am

Hmmm… from a brief survey, it’s hard to tell whether the captured stack trace of the main thread is just a moment in time, or something frozen that way. The main thread is release the reentrant lock that is the workflow lock some 10ish frames above the dump point, while the KNIME-Worker-8-Row Filter 0:1476 thread is blocking waiting on that lock.

The main thread definitely has a lot of (workflow notify wrapped up nodes - who notify their wrapped up nodes - … who tell 2 friends, who tell 2 friends, …) Perhaps as a(n ugly) work around you could re-ugly your workflow and take everything out of their wrappings?

In the meantime, it need be triaged further by dev.

Edlueze · May 6, 2020, 11:34am

Thanks quaeler - your analysis was very helpful. I couldn’t spot the wrapped up nodes from the log file (how did you find that?) but that was the problem. “Unwrapping” seemed to help and now all my nodes start running within seconds. I put “unwrapping” in inverted commas because this should have only been a cosmetic change - the workflow logic shouldn’t have changed a bit. My workflow now looks hideously ugly, but it is running. I’ll follow-up more tomorrow. Just wanted to say thanks and let you know that I’m back on track.

Edlueze · May 6, 2020, 1:24pm

I wanted to close out this discussion with an overview of what I was trying to do.

I have a lot of reporting nodes running off a common set of data and databases. To keep everything tidy, I created MetaNodes to pass through the common connections. I tried to wire up 6 “Report Data” MetaNodes in what looks like a series configuration but what, in fact, is parallel. Each of these “Report Data” MetaNodes has an inner loop. But each also passes through the upstream data directly to the next MetaNode, so all MetaNodes can run together. I don’t think this configuration would work with Component MetaNodes (as the whole node needs to finish before enabling the output), but regular MetaNodes should work fine.

Unfortunately this configuration doesn’t seem to work, and I needed to switch to a hideously ugly vertical configuration. But logically there should be no difference in how the workflow is run. You can see the “before” and “after” in the attached screenshots.

Let me know if you spot something that would allow me to go back to my original configuration.

Thanks again!

quaeler · May 6, 2020, 2:58pm

I’m glad that there’s at least a functional middle ground to get you working again until the bug is sussed out.

The log hint as to the wrapped up nodes being likely involved is the stack trace is this portion of the “main” thread:

	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2061)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2047)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2047)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2047)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2047)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2047)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2723)
	at org.knime.core.node.workflow.WorkflowManager.markForExecutionAllAffectedNodes(WorkflowManager.java:2058)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueueNodeAndPredecessors(WorkflowManager.java:2771)
	at org.knime.core.node.workflow.WorkflowManager.markAndQueuePredecessors(WorkflowManager.java:2731)

A wrapped up set of nodes is described by a WorkflowManager that is a child of its parent (or the root) WorkflowManager - so seeing this in the stack trace looks like a parent marking its child, marking its child, marking its … ,

Edlueze · May 7, 2020, 5:11am

I created a simple test workflow to see if I could reproduce the problem, but so far KNIME is working as expected. I may have created a short-circuit or infinite loop in my original workflow, but even that might be of concern as something KNIME should have caught. I’ll continue to be watchful.

In the meantime, here is my test workflow with screenshots:

MetaNode Inner Loops 001.knwf (53.1 KB)

wiswedel · May 11, 2020, 8:16pm

(I am one of the KNIME guys; probably the one on whose desk this issue would land once we can reproduce it…)

Thanks for sharing all the insights and attempting to reproduce the issue. I can confirm all your observations. The call stack as extracted by @quaeler looks complex but still “normal” given the nesting you have in the workflow. I poked around with the workflow you attached in your last message but it’s all well-behaving.

Once you have more insights/a recipe I am happy to dig deeper.

Bernd

system · November 10, 2020, 8:24am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.