Apache Spark - ISSUE _Preview

Hi,
I have a problem with Apache Spark. When I want to view the tables in Preview, it takes a very long time or does not load at all. Is this a bug? I didn’t have this problem a few weeks ago…
Regards

Hi @MelanieTU,

Spark executes all tasks lazy. This means, the last node in a queue of nodes, that exports data from Spark like Spark to Parquet, Spark to Table or loading the preview, executes all nodes and jobs in front of it. Depending on the task, this might take some time. Spark tries to optimize the execution and requires all the puzzle items to do this.

In KNIME, the Spark nodes in front of your current node might execute very fast and the last one takes the time, the entire execution of the subtasks take in Spark.

Maybe something in your workflow changed and now takes longer. Or did you upgrade KNIME and the same workflow + data takes longer?

Cheers,
Sascha

2 Likes

Hi @sascha.wolke ,
yes, I have just added a pyspark script to my workflow. All pyspark scripts in the workflow run fast, including the newly added pyspark node. Only when I want to look at the table in preview mode in the last new script after the script has been executed, it takes forever or is not loaded at all… .

Regards

Hi @MelanieTU,

If you have a couple of Spark / PySpark script nodes connected, maybe start with the first and open the preview. Then continue with the other ones and you might identify the slow one.

Another option might be the Spark UI. You can find them with a right click on the Context create node (Create Spark Context (Livy) , Create Databricks Environment, Create Local Big Data Environment) and select Spark context. Somewhere at the top should be a link to the Spark Web UI.

Cheers,
Sascha

1 Like

Hi @sascha.wolke,
I have now started every single pyscript node in my workflow individually and looked at the Spark UI. The problem is with the last pyspark node (see picture). How can I solve the problem or what could be the reasons that this script in the pyspark node takes so long to look at the table in preview mode. I understand the “lazy” thing… but with all my other scripts it takes at most 1-2 seconds until I can display 100 rows…

Regards

Hi @MelanieTU,

The important step is to open and load the preview at the output port of every single PySpark node.

There might be many reasons why the script does not finish. I guess you have a small dataset? Can you share the workflow or the PySpark code? Perhaps someone can identify a possible problem.

Cheers,
Sascha

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.