Hi all,
I think this post is important/describes a quite severe bug in handling date× but if it is to long tl;dr:
-
The “Guess data type…” feature suggests a completely wrong combination of “YYYY” for a week-based-year format with “dd” or “MM” which has no use but corrupts the data. Dates are then of by around one year. Additionally if users type this wrong combinations in Knime freely allows this. This should be changed.
-
additionally the description of nodes should include a short 3 bulletpoint summary of what a node uses for calculation. Often its just a blackbox one has to trust.
after having a quite bad error in data described here:
https://forum.knime.com/t/serious-error-in-date-time-to-string-node/43211/3
i studied the last 2 days how KNIME handles dates and Times more closely and i feel that some effort should be undertaken to make this more error-free and in the documentation of nodes more consistent.
One quick example with the String to Date&Time node:
If you are not sure which Date format is the right one and clikc on autoguess it gives you:
an error message regarding time, which is a bit nonsensical in our case since data has Date&Time but you think well that is not a problem i can fix this to Date&Time. But while you are distraced with the not so useful red message, a severe bug has been introduced to your date: “YYYY” which is a so called “week-based-year” format has been combined with an exact “dd” day of month and “MM” month format. The 2 should never be combined and almost guarantess wrong data but more to this later. So i thought no prolbem and changed it to “Date&Time”:
Notice how the wrong combination of “YYYY” with “MM” and “dd” stays while the user sees the red ink and thinks: “Well yes no problem”.
While in reality what has happened with “YYYY” now is that the digits for the year follow the rules of the the week-based-year and are now combined with an precises day of month “dd” and exact Month “MM” - which never should be done and leads with almost certainty to errors.
The week-based-year format “YYYY” is intended for a week-based-year time format so somehting pretty exotic like:
2022-W21 or (this reads calendar week 21 of 2022)
2022-W21-5 (weekday 5 of Week 21…)
This formats are used when someone “thinks in Weeks” more than in years. So the rule is that a week is never allowed to broken into “parts” just because it’s end of the year.
You can read this up on the internet for “week-based-year” i can not write all this here. But basically for this format there are this 4 rules:
- a week goes from weekday 1 (Mo) till 7 (Sunday)
- all days of a week get the same year attached - no matter in what year they actually are
- a year consists of 52 or 53 weeks
- Week 1 of a new year is when the “normal calendar” (gregorian) has a new year and the first thursday of that year will be in week 1 and all the days of that week starting from monday get the year of the new year assigned. Every day before that will be moved to the old year and gets the year of the old year attached.
So “YYYY” attaches with a 86% chance either the last 1-3 days of the old year or 1-3 day of the new year a wrong year. And it does this without regrets since it “says” “a week has a year attached and that is important days don’t really count.”
As i said It’s fine if you use it with a truly week based date like:
2022-W51
But what makes absolutely no sense is to use it in conjunction with “dd” or “MM” this produces just wrong dates around ending of year, beginning of the new year.
And this is what happened in my case (in the other thread).
the date i stumbled across was the 1st january 2022.
So according to the rules the first thursday of the year 2022 is the 6th of jan. So the week from the 3rd to the 9th of january is week 1 (w1) of the new year.
And the january 2nd and first get the year of 2021 assigned. Leading to the wrong timestamps for those days of 01.01.2021 and 02.01.2021 (european format: dd.MM.YYYY)
So while in the week based format just 3 days stay in the week and therefore have a wrong year assigned (but since they in the right week its just slightly off) this produces now timestamps that are 365ish days wrong.
By now it should be clear that “YYYY” belongs strictly to dateformat that is weekbased (has soemthing like “W17” in it) and can NEVER be mixed with “dd” or “MM”. It’s not unexpected or quirky it’s plain wrong. It just creates wrong timestamps with no use to think of.
If one does it he basically just activates a lottery that the last 3 days of a year the 29th-31st or the first 3 days of the new year 1-3. of january will get randomly a wrong year attached.
Again: If someone uses something like: YYYY-MM-dd they clearly don’t know what they are doing (like me for the last year) it just leads to corrupted data.
It’s quite disastrous that the autoguess feature guesses this wrong format.
I find this in the pulldown menues of this node and the similar node too: String to Date&Time. I don’t know that it is because i (unfortunately) used this the last year, or if this is for other useers the same.
What should be done about this?
-
obviously: If a user types in something YYYY only allow weekbased formats with something like “W52”. If the user types “dd” or “MM”. it should be blocked and the user alerted that this is wrong and corrupts the dates.
-
If someone with solid knowledge of java date&time formats could go through the nodes dealing with date and time and have look if everything is as it should be. Sometimes a bit quirky stuff goes on i have not found another error, but i found it a bit worrisome that the String to Date&Time node cuts of the seconds for example even if the format specifically included “HH:mm:ss” and the input data has seconds. They are all 00 but the minutes are all 00 too and don’t get cut off… This post is long enough.
For financial data or any time series data the timestamp is basically everything otherwise it’s like having wrong price data.
And what would be really good if in the future the descriptions of the nodes would include at the top in the help on what a node depends to work. Just that the person who wrote it includes a short 3 bulletpoints what this node uses: This node uses: java standard editon 8 all known problems of DateTimeFormatter lead to problems here. Or this node uses the libraries of Python XXX for numerical calculations… No excruciating details just 3 sentences what’s the fundament of the node. It’s right now a pretty opaque black box often and it doesn’t fell safe for me anymore. To be fair the Date&Time to String and vice versa did link to the java DateTime formatter page but other nodes often don’t have any explanation of what they are doing under the hood. … Sometimes the explanations even contain a warning about the week-based-year like in the Extract Date&Time node but this so important info is completely missing in the aforemetined nodes…
So i like Knime but this has been a bit of a sobering experience for me the last days…
Have a good one…