Date & Time format forced conversion issue in Pyspark Script node

JaeHwanChoi · September 21, 2023, 2:37am

Hello KNIME Support.

When using the Pyspark Script node, the format of the Date & Time column is forcibly reformatted when exported to a dataframe in the Output Port.

In the image below, when I run the code within the Pyspark Script node, the Date & Time format of the Date column is output as 2010 - 02 - 05 00:00:000.

However, when you run that Pyspark Script node and export the spark.dataframe through the Output port, the Date column changes to the following.

I need to use the format “2010 - 02 - 05 00:00:000”, which is the format of the Date column derived when executed inside the Pyspark Script which is the format of the Date column when executed from within the Python script, but the output is not what I want.

To change the column to the format I want, I need a node for Spark’s Date & Time conversion, which doesn’t exist.

Also, I can’t convert this to Spark to Table and use the Date & Time conversion node because the amount of data is so large that it would be very inefficient to convert it to table and preprocess it.

Unfortunately, I can’t share the code and data used due to ongoing customer security issues.

Is there a way around this, or is it simply an unsupported format within Pyspark Script?

Any answers would be greatly appreciated.

mlauber71 · September 22, 2023, 8:53pm

@JaeHwanChoi maybe you can elaborate what it is you want to do and mabye you can create and example using the local Big Data environment of KNIME to demonstrate how a Date/Time variable is being handled and where you experience problems.

Where would the results from the PySpark nodes be stored and would this require a Date/Time variable in a generic format (then it might be a question of displaying the results) or would you rather have a string that would represent the Date/Time without the possibility to be used as such.

sascha.wolke · September 26, 2023, 8:14am

HI @JaeHwanChoi,

Most of the time, the date & time columns stored efficiently without any formatting if you store them, e.g. as a parquet. If you need to format them, then you might convert the column to a string column using pyspark.sql.functions.date_format — PySpark 3.5.0 documentation or a Spark SQL snippet using Datetime patterns - Spark 3.5.0 Documentation.

Cheers,
Sascha

system · December 25, 2023, 8:14am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.