Some questions about spark nodes.

Hello.

I have some questions about spark nodes.
My question may be obvious, but if you can answer it, it will help me to understand more completely.

  1. How to use the nodes below?

    • Persist Spark DataFrame/RDD
    • Unpersist Spark.DataFrame/RDD
  2. There are two nodes with the same functionality, but one is just a spark node and the other uses MLlib.
    What is the difference between them? And which of the two is better?
    (example)

    • Spark Decision Tree Learner
    • Spark Decision Tree Learner (MLlib)
  3. Is there any good idea to use “Spark Missing Value” node for some specific columns?

Hi @hhkim -

The persist/unpersist nodes are really about caching data for improved performance in Spark. The persist node is probably the more commonly used of the two. You can see a nice example use case here where a dataset is specifically persisted prior to it being looped over repeatedly:

Of course once you no longer have a need for the data to be persisted, you can unpersist it as a way to keep your Spark instance tidier.

The difference between the two Spark Decision Trees is which library is used to implement the DT algorithm. The former uses spark.ml and the latter uses MLlib, and they write different types of models at their output ports. I don’t know that one is better than the other per se, but I can tell you that in our L4 courses we primarily teach using the spark.ml version.

I don’t really understand what you mean by your third question. In general, whether using Spark or not, the Missing Value node is used to either impute or remove data based on whether something is missing or not. Maybe you have an example you’re thinking of?

2 Likes

@hhkim the question when to persist and unpersist data in spark has to do with the concept of lazy evaluation. Spark will not immediately execute a command but ‚make a plan‘ and execute once data has to be written or transferred. After it is finished the data will be ‚forgotten‘ and only the plan remains.

In knime for example if you are in a process of developing things or you use spark data in a loop it might make sense to persist at an intermediate stage so the whole plan would not have to be executed again.

You can find an example here where a persist is used just before the loop.

Persist can also be used to make a process more stable. But it comes with a cost in the form of memory and time (to collect the data and organise the storage in memory or on disk). So you might have to plan ahead if persisting is the right move.

Then if you want to use spark and machine learning taking a look at the H2O.ai integration in knime might be of interest.

2 Likes

Thanks for your reply.
My third question means that if my dataset contains empty columns and I use this data as input for “Spark Missing Value”, I will get an error because of the empty column.

However, this node does not have an option to select a column.
So, to remove missing values from some specific columns, I split the columns, remove the missing values and combine them again.

It’s a bit cumbersome because I have to go through several steps to get the data I want.
So I was wondering if there is more simple way.

@hhkim spark based ML model tend to be very picky when it comes to missings and NaN and continuous variables. This is why I go thru all these steps to make sure I have only ‘clean’ double values without any missings. Sometimes even when a model would compute first it might fail once you try to apply it. Would also much depend on the spark version from my experience.

So you might want to check out all these steps before you calculate your model.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.