Dates changing when read from spark

Hello!
I am using HDFS connection and LIVY for Spark. Then I am reading data from *.parquet files on HDFS.
The data has partitions based on dates (researchdate column). I can see all partitions in files explorer in “Parquet to Spark” node.
But when I pull data thru “Spark SQL Query” node, like:

SELECT count(*), researchdate FROM #table#
group by researchdate
order by researchdate desc

On top I see the date-1, not the current last date. For examle, today 2021-08-19, but it shows the data as 2021-08-18.
It is strange, because the count(*) value is equal to value I get from query in Zeppelin, for example. But the date differs.

Any ideas why KNIME could change date?

Hi @sin_aa , and welcome to the Knime Community.

I don’t think Knime is change the date, or any data for that matter. Knime would just read and display what it’s getting from the server.

Could it be that the data was cached? Try to disconnect and re-run the query.

Also, from Knime, can you run:
SELECT MAX(researchdate) FROM #table#

And see what you get?

3 Likes

Hello, thx!
I tried restart, disconnect and so on - result the same.
SELECT MAX(researchdate) FROM #table# gives “2021-08-18”
But the same query in Zeppelin gives “2021-08-19”…

It is strange, but current_date() returns 2021-08-18 in KNIME (yesterday).

Hi @sin_aa,

welcome to the KNIME community! Spark does some timezone conversions depending on the Spark version you use. Is your cluster running on UTC or some local timezone? The Create Spark Context (Livy) node has a time tab in the configuration dialog that might help.

See spark.sql.session.timeZone in the Spark documentation.

Cheers
Sascha

4 Likes

Hi @sin_aa , based on the info you provided (Max and Current date), and based on what @sascha.wolke mentioned, it means that your date data is probably stored with Timezone, and Spark is returning (it converts the date) the date based on the Timezone of the requester.

So, it returned the date that your Knime’s timezone setting is set to. Your Zeppelin must be set to a different timezone. That is why you are getting the same count amount on both systems, but with dates being different.

2 Likes

Thanks @bruno29a and @sascha.wolke a lot!

Problem was solved by adding this prefs to "Create Spark context (LIVY)’ node:

spark.sql.session.timeZone | UTC
spark.executor.extraJavaOptions | -Duser.timezone=UTC
spark.driver.extraJavaOptions | -Duser.timezone=UTC
4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.