Hello!
I am using HDFS connection and LIVY for Spark. Then I am reading data from *.parquet files on HDFS.
The data has partitions based on dates (researchdate column). I can see all partitions in files explorer in “Parquet to Spark” node.
But when I pull data thru “Spark SQL Query” node, like:
SELECT count(*), researchdate FROM #table#
group by researchdate
order by researchdate desc
On top I see the date-1, not the current last date. For examle, today 2021-08-19, but it shows the data as 2021-08-18.
It is strange, because the count(*) value is equal to value I get from query in Zeppelin, for example. But the date differs.
Hello, thx!
I tried restart, disconnect and so on - result the same. SELECT MAX(researchdate) FROM #table# gives “2021-08-18”
But the same query in Zeppelin gives “2021-08-19”…
It is strange, but current_date() returns 2021-08-18 in KNIME (yesterday).
welcome to the KNIME community! Spark does some timezone conversions depending on the Spark version you use. Is your cluster running on UTC or some local timezone? The Create Spark Context (Livy) node has a time tab in the configuration dialog that might help.
Hi @sin_aa , based on the info you provided (Max and Current date), and based on what @sascha.wolke mentioned, it means that your date data is probably stored with Timezone, and Spark is returning (it converts the date) the date based on the Timezone of the requester.
So, it returned the date that your Knime’s timezone setting is set to. Your Zeppelin must be set to a different timezone. That is why you are getting the same count amount on both systems, but with dates being different.