An example what can go wrong with DatesTimes in Knime: the String to Date&Time node

Hi all,

I think this post is important/describes a quite severe bug in handling date&times but if it is to long tl;dr:

  • The “Guess data type…” feature suggests a completely wrong combination of “YYYY” for a week-based-year format with “dd” or “MM” which has no use but corrupts the data. Dates are then of by around one year. Additionally if users type this wrong combinations in Knime freely allows this. This should be changed.

  • additionally the description of nodes should include a short 3 bulletpoint summary of what a node uses for calculation. Often its just a blackbox one has to trust.

after having a quite bad error in data described here:
https://forum.knime.com/t/serious-error-in-date-time-to-string-node/43211/3

i studied the last 2 days how KNIME handles dates and Times more closely and i feel that some effort should be undertaken to make this more error-free and in the documentation of nodes more consistent.

One quick example with the String to Date&Time node:

If you are not sure which Date format is the right one and clikc on autoguess it gives you:
AutoGuessWrong1

an error message regarding time, which is a bit nonsensical in our case since data has Date&Time but you think well that is not a problem i can fix this to Date&Time. But while you are distraced with the not so useful red message, a severe bug has been introduced to your date: “YYYY” which is a so called “week-based-year” format has been combined with an exact “dd” day of month and “MM” month format. The 2 should never be combined and almost guarantess wrong data but more to this later. So i thought no prolbem and changed it to “Date&Time”:

AutoGuessWrong
Notice how the wrong combination of “YYYY” with “MM” and “dd” stays while the user sees the red ink and thinks: “Well yes no problem”.

While in reality what has happened with “YYYY” now is that the digits for the year follow the rules of the the week-based-year and are now combined with an precises day of month “dd” and exact Month “MM” - which never should be done and leads with almost certainty to errors.

The week-based-year format “YYYY” is intended for a week-based-year time format so somehting pretty exotic like:
2022-W21 or (this reads calendar week 21 of 2022)
2022-W21-5 (weekday 5 of Week 21…)

This formats are used when someone “thinks in Weeks” more than in years. So the rule is that a week is never allowed to broken into “parts” just because it’s end of the year.
You can read this up on the internet for “week-based-year” i can not write all this here. But basically for this format there are this 4 rules:

  • a week goes from weekday 1 (Mo) till 7 (Sunday)
  • all days of a week get the same year attached - no matter in what year they actually are
  • a year consists of 52 or 53 weeks
  • Week 1 of a new year is when the “normal calendar” (gregorian) has a new year and the first thursday of that year will be in week 1 and all the days of that week starting from monday get the year of the new year assigned. Every day before that will be moved to the old year and gets the year of the old year attached.

So “YYYY” attaches with a 86% chance either the last 1-3 days of the old year or 1-3 day of the new year a wrong year. And it does this without regrets since it “says” “a week has a year attached and that is important days don’t really count.”

As i said It’s fine if you use it with a truly week based date like:
2022-W51

But what makes absolutely no sense is to use it in conjunction with “dd” or “MM” this produces just wrong dates around ending of year, beginning of the new year.

And this is what happened in my case (in the other thread).
the date i stumbled across was the 1st january 2022.

So according to the rules the first thursday of the year 2022 is the 6th of jan. So the week from the 3rd to the 9th of january is week 1 (w1) of the new year.

And the january 2nd and first get the year of 2021 assigned. Leading to the wrong timestamps for those days of 01.01.2021 and 02.01.2021 (european format: dd.MM.YYYY)

So while in the week based format just 3 days stay in the week and therefore have a wrong year assigned (but since they in the right week its just slightly off) this produces now timestamps that are 365ish days wrong.

By now it should be clear that “YYYY” belongs strictly to dateformat that is weekbased (has soemthing like “W17” in it) and can NEVER be mixed with “dd” or “MM”. It’s not unexpected or quirky it’s plain wrong. It just creates wrong timestamps with no use to think of.

If one does it he basically just activates a lottery that the last 3 days of a year the 29th-31st or the first 3 days of the new year 1-3. of january will get randomly a wrong year attached.

Again: If someone uses something like: YYYY-MM-dd they clearly don’t know what they are doing (like me for the last year) it just leads to corrupted data.

It’s quite disastrous that the autoguess feature guesses this wrong format.
I find this in the pulldown menues of this node and the similar node too: String to Date&Time. I don’t know that it is because i (unfortunately) used this the last year, or if this is for other useers the same.

What should be done about this?

  • obviously: If a user types in something YYYY only allow weekbased formats with something like “W52”. If the user types “dd” or “MM”. it should be blocked and the user alerted that this is wrong and corrupts the dates.

  • If someone with solid knowledge of java date&time formats could go through the nodes dealing with date and time and have look if everything is as it should be. Sometimes a bit quirky stuff goes on i have not found another error, but i found it a bit worrisome that the String to Date&Time node cuts of the seconds for example even if the format specifically included “HH:mm:ss” and the input data has seconds. They are all 00 but the minutes are all 00 too and don’t get cut off… This post is long enough.

For financial data or any time series data the timestamp is basically everything otherwise it’s like having wrong price data.

And what would be really good if in the future the descriptions of the nodes would include at the top in the help on what a node depends to work. Just that the person who wrote it includes a short 3 bulletpoints what this node uses: This node uses: java standard editon 8 all known problems of DateTimeFormatter lead to problems here. Or this node uses the libraries of Python XXX for numerical calculations… No excruciating details just 3 sentences what’s the fundament of the node. It’s right now a pretty opaque black box often and it doesn’t fell safe for me anymore. To be fair the Date&Time to String and vice versa did link to the java DateTime formatter page but other nodes often don’t have any explanation of what they are doing under the hood. … Sometimes the explanations even contain a warning about the week-based-year like in the Extract Date&Time node but this so important info is completely missing in the aforemetined nodes…

So i like Knime but this has been a bit of a sobering experience for me the last days…

Have a good one…

5 Likes

Hello @TotalDataLoss,

thanks for taking the time to dig into that problem and pointing it out. You are absolutely right: YYYY should not be used together with dd or MM.

I just wanted to give you an explanation how the “Guess data type and format” feature works and why this issue does not affect other KNIME users in general.
Guessing the exact format just from a date&time string is very difficult (rather impossible) as there are different formats fitting a particular string, e.g. “2022-10-05” could be “yyyy-MM-dd” or “yyyy-dd-MM”. By looking as a human on a string, you might say it must clearly be that particular format but while in one country it is more common to use “yyyy-MM-dd”, in another country it might be “yyyy-dd-MM”.
So, instead of guessing the format from the string, the String To Date&Time node has a pre-filled list of common date&time formats (the combobox where you can enter a format). When clicking the “Guess data type and format” button, the node (which uses Java under the hood) goes through each of the formats and tries to apply it to the string. If there is a mismatch, Java will complain. If not, the node guesses this format being the correct one.

So, how could it be that the node guesses a format for you combining “YYYY” with “dd”? By default, the pre-filled list of formats does not contain such a format. However, if a user enters and uses a custom format, the node will remember that (so that in future it is able to guess that format which might be commonly used one by that particular user and, thus, makes life much easier for him). That said, this format is guessed to you because you entered it previously yourself. In general this means that this format is not suggested to other KNIME users (unless they enter it manually themselves).

I would go with Scott and still not declare this as a bug in the node. Date&time is a very complex topic. In general, we don’t do any additional “format check” but trust on Java to decide if a format is valid or not and on the user to make sure a manually entered format is correct. Sadly, “YYYY-MM-dd” seems to be valid for Java and for a user it is in that case hard to know that this will turn out in a incorrect result.

Here are a few thougths from my side what I think we could do to improve:

  • We could add an explicit check for that particular combination of YYYY and MM etc. Though, there might be many formats that are actually invalid (or should not be used) but accepted by Java. It’s hard to check all these formats. So this approach might be a bit shortsighted.
  • We could make it more clear that the node just gives a guess and that a guessed format should always be double-checked.
  • We could change the behavior of the node so that manually entered formats are not used for guessing. I think that this feature is useful many users though.

I hope this post helps you understanding the issue. Happy to hear your thoughts and maybe other suggestions for improvements.

Best,
Simon

5 Likes

Hi Simon,
thanks for picking up on this. The philosophy of KNIME seems to be that it is an mere interface for java or another programming language. And if java does something it y must be right.

For me it is plain wrong when “YYYY” gets mixed with “MM” or “DD” - it not useful. It’s creates wrong data. For me it is a bug of the java DateTimeFormatter Edition 8. But you know my stance on this.

My problem is more now for transparency. If KNIME nodes are "simply a “pass-through” for every mistake that then might happen in the actual language executing it, what i wrote should really really being done for the nodes:

And what would be really good if in the future the descriptions of the nodes would include at the top in the help on what a node depends to work. Just that the person who wrote it includes a short 3 bulletpoints what this node uses: This node uses: java standard editon 8 all known problems of DateTimeFormatter lead to problems here. Or this node uses the libraries of Python XXX for numerical calculations…

Otherwise one willl end up maximizing errors: The errors form Java or every other language executing it and on top then the potential errors of the Knime-Layer. It would be so much more transparent if every nodes get on top what it needs to function properly.

On a personal note: The reason why i installed KNIME in the first place was that Panda produced erroneous data sets when data sets became large.

I thought more that KNIME could become the trusted solution for handling large sets of data.
The idea that a node might use internally Pandas something else not very reliable (just as an example) scares me… The whole idea was to get away from the (often buggish) libraries.

I am a bit old-school on this and very much focused on reliable data…
On the other hand testing is very time consuming so it would be good if one “knows what uses what” in Knime and is already oriented…

Greetings!

1 Like

The first part about better documentation is paramount for me, but here are some more random thoughts/questions about this:

  • Is there a way (including more hackerish ways) to find out what a node uses internally? Would be good to have something immediatley.
  • Do you have some insight why the seconds are beeing cut off? according to the specified format for the time part (HH:mm:ss) this is wrong too?
  • Since the guess feature just repeats what a user has typed in, it would bee good if i could delete values from it (like now the wrong “YYYY”)
  • For your thoughts according to warning the user: Yes that should definitely beeing done if a user types in “YYYY” and then adds “dd” or “MM” (but please) don 't just add this and don’t change the documentation… And the hint with the guessed format is Ok too…

Ok that’s it for now…
have a good one!

@TotalDataLoss, you rose very interesting topic. So far, I’ll recommend to connect @mlauber71 to the discussion. He is very knowledgeable person in terms of libraries KNIME can use. And also look at Date Extractor — NodePit
node. It automatically recognizes number of date formats and could be less frustrating.

2 Likes

@izaychik63: Hi thanks for your reply will check this out tomorrow. If you have connection to @mlauber71 bring him in here all for it…

Till later

Thank you for the compliments @izaychik63. I have made a few remarks about date and time variables in the other thread and provided some links:

Could you give an example. I think the seconds should be stored it might just be that they are not immediately visible. You could check using the Extract Date&Time Fields node.

What you could also do is just use Java code to create Date and Time variables and also manipulate them:

And then about the guessing and the week-based year. From your screenshot and what I tried I see the red warnings around Time and Date and the execution would not immediately give a result, though it would let me extract a time.

So the recommendation still stands to use extra caution with all things relating to date and time (and timezones), err on the save sind, read the documentation and test fringe cases. And if you transfer date and time variables from external sources do it as string or ‘readable’ numbers - or establish a single source of truth by using just one data storage like KNIME table, Parquet or H2 or SQLite databases (they all come with their own set of quirks concerning date and time variables)

2 Likes

Hi all,

I will just add a few answers to your questions @TotalDataLoss

This has already bin discussed in this topic.

There is currently no way to delete entries via the AP. You find them stored in your workspace in <workspace>/.metadata/knime/history_string_to_date_formats.txt. Just delete the lines you want to be removed.

As KNIME AP is open source, you could check the code base on GitHub. For the String to Date&Time node, you can find the code here: knime-base/org.knime.time/src/org/knime/time/node/convert/stringtodatetime at master · knime/knime-base · GitHub

Best,
Simon

3 Likes

Hi Simon,

There is currently no way to delete entries via the AP. You find them stored in your workspace in <workspace>/.metadata/knime/history_string_to_date_formats.txt. Just delete the lines you want to be removed.

This worked well…Thx.

As KNIME AP is open source, you could check the code base on GitHub. For the String to Date&Time node, you can find the code here: knime-base/org.knime.time/src/org/knime/time/node/convert/stringtodatetime at master · knime/knime-base · GitHub

Interesting link. As a non coder i would clearly prefer that the documentation of the node would contain the 3 most importatn bulletpoint of what is going on under the hood. But this was nevertheless interesting. But since i am not used to github to much, i was not able to find the “column expressions node” searching for javascript didn’t help neither…

But Thanks anyways, have a good one!

1 Like

Hi Mark always good to hear from you.
You can see this behaviour in the Workflow i posted in the other thread (in the string to date&time node). You are right, they are still there but one can not see them. Even tho the format HH:mm:ss specifically includes the seconds the java renderer truncates them, according to the link Simon provided this.
From your post on the other thread: some things i did already know. But i did learn 2 really interesting tricks. Your trick with the human readable dateformat: much more handy than unix timestamps in some cases! I adjusted it a bit: Year*1 Million. That still fits in integer and gives enough space down till HH. Great stuff, used it already in an error check for the workflow. The second trick i took from your workflow you posted: how to add the missing zero in front of string with the rules engine. I just thought about it today and and then i found it in your workflow: perfect. Thanks for your input!

The Java snippet looks very interesting too i will have a look at this when i have time at my hand (and more fit than now)

Have a good evening till next time!

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.