What does Number of rows to validate on mean in Pyspark?

JaeHwanChoi · July 25, 2023, 8:56am

Hello!

This might be a basic question, but what does Number of rows to vaildate on mean in a Pyspark Script?

Does increasing the number make it faster?

I would be grateful for an answer.

ScottF · July 26, 2023, 4:14pm

This number represents how many values in the dataset you want to test your script on, essentially. Clicking the “Validate on Cluster” button will pass that number of rows (in this case, 100) and return the results in the console below. This allows you to do a quick QC on your script before running on potentially much larger datasets.

Increasing the number doesn’t have an effect on the speed of the node execution.

JaeHwanChoi · July 27, 2023, 1:34am

Thank you for your reply. @ScottF !!

So assuming my current DataSet has 1000 rows and 5 columns, if I specify Number of rows to validate on as 200,
you are saying that the analytic code in my script will use only 200 of the 1000 rows and show the results in the console below?

Is there any difference in the time it takes to get the result by setting Number of rows to validate on to 200 inside the Scirpt, and the time it takes to run the Script to run all 1000 rows to get the result?

Your answer would be appreciated.

ScottF · July 27, 2023, 1:52pm

The validation is just there to help you understand if your code works “in the moment” on a small subsample of your data. In the realm of Spark, you might be dealing with billions of rows of data, so doing a quick check with 200 rows can give you peace of mind before you run the node “for real”.

Changing this number has no effect on performance when you execute the node with the full dataset.

JaeHwanChoi · July 28, 2023, 1:18am

Thank you for your kind response. @ScottF !

I understand perfectly thanks to you.

system · August 4, 2023, 1:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.