How to reduce table conversion time for Spark-related nodes

JaeHwanChoi · August 11, 2023, 4:36am

Hello KNIME support team and users.

I am trying to export the resulting table to a “Writer node” using a preprocessing node using Spark (e.g. “Spark Normalizer”, “Spark Column Rename”, etc.).

In this case, it seems essential to use a node like “Spark to table” to transform the data.

Is there any way to skip this intermediate process of “Spark to table” and export directly to “Writer node”?

Because, Spark is useful for handling big data, and if I use a node like “Spark to table” to convert the resulting table, it takes too much time for the node to operate.

Your answer will be appreciated.

Daniel_Weikert · August 11, 2023, 4:14pm

What kind of datasize are we talking about and how long does it take?
br

JaeHwanChoi · August 16, 2023, 1:31am

Hi, @Daniel_Weikert

I know that if I want to load a large amount of data from DB or Local (I loaded 100 million data as a parquet in Local) and connect it with a Spark node, I must go into the “Table to Spark” node.

Also, after the “Table to Spark” node, I need to go through several preprocessing nodes related to Spark, and then I need to go through the “Spark to table” node to save this data as a csv or parquet to my personal PC as a “Writer node”.

These “Table to Spark” and “Spark to Table” nodes took about 15 minutes to convert 100 million data. Is there any way to reduce this time?

Are transformation nodes like “Table to Spark” mandatory? Are there any other options?

Thanks.

mlauber71 · August 16, 2023, 5:23am

@JaeHwanChoi where is this Spark process being done? On your local machine or on a server?

You need to take into account how Spark works with Lazy evaluation - Wikipedia. That means the process will only start once you want something done like writing the data to a table. The other spark nodes which seem very fast are just ‘plans’ until execution starts.

Using Spark will always come at an additional cost by the machine setting it up and starting the workers - the benefit is you can process very large amounts of data by using the memory of your hopefully many distributed servers. The performance very much depends on which resources you have in the form of RAM and processor and how many …

You can also write the result to a big data System, a table with Hive or Impala - which would be a standard procedure.

JaeHwanChoi · August 16, 2023, 8:10am

Thank you for your response. @mlauber71

Currently, the spark process is in use on both the local machine and the server.

This means that considering the way spark works, nodes like writing data to the table (Table to Spark, Spark to table) should be mandatory, right?

The reason I want to convert to table is because I need to put it into object storage in the form of file.

If I don’t use a big data system like Hive or Impala, it is inevitable that the speed will be slow, right?

sascha.wolke · August 16, 2023, 9:08am

Hi @JaeHwanChoi,

The Table to Spark and Spark to Table nodes transfer the data between the Spark cluster and the KNIME AP/Server, this might take some time on large data.

There are dedicated nodes to read/write the data using Spark on the cluster, like Parquet to Spark and Spark to Parquet. You can read/write them in KNIME later on using the Parquet nodes.

Cheers,
Sascha

mlauber71 · August 16, 2023, 2:00pm

@JaeHwanChoi basic question: do you actually need big data technology because your data is so large it would not fit in a ‘normal’ server/computer; and if yes: do you have the resources to do it. The Create Local Big Data Environment – KNIME Community Hub ist just there to demonstrate and develop not to be used in real production.

You could set up a spark cluster on a machine (like here: Read ORC file into KNIME's Python node – KNIME Community Hub) but I do not think that would bring any benefits in speed but might just need more power for the handling of it.

I think you might want to just do the transformations on your machine/sever with traditional KNIME (or Python) procedures. Or you could use a database.

Concerning performance there is this:

system · November 14, 2023, 2:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.