Bug: "Execute failed: Failed to check for duplicate row IDs" in Joiner

Marlin · February 18, 2015, 10:58am

Problem 1: This error appears. Whatever this error is talking about, it's not talking about something in my area of responsibility. Dealing with row IDs should actually never be my responsibility at all, but I accepted the fact that it often is. The Joiner is different, though, as it already has a handling strategy. If I had to validate the strategy manually, that would make the strategy obsolete, and just be ridiculous. So, if this message is not about something I'm responsible for, that means it's talking about the Joiner itself, it informs me about some implementation error.

Problem 2: This error apparently waited somewhere for 18 hours until the Joiner was already wrapping up everything, popping up only then as if to mock me. I suppose nobody thought about this as this error message seems to be some of these with a "should never happen" comment... well, it did happen.

This bug may be related to the Joiner filling up my temp folder with 60G of temporary files until the hard drive was full. According to itself, it was already wrapping things up at 99%, so maybe a full disk is handled correctly almost everywhere but not in some late part? Maybe it's right that there's an error message, but it's the wrong one?

Sadly I can't provide you with an error log. Writing to a log seems to be difficult without somewhere to put it...

thor · February 18, 2015, 12:48pm

So the real problem seems to be your full hard disk...

The reason the error pops up so late is that we cannot check immediately when a row ID is created if there is a duplicate, especially for very large tables because we cannot keep all row IDs in memory. We check against the last few ten-thousands of rows or so but the complete check is performed in the end. Having said that the row ID duplicate checker also needs some disc space and if you don't have any left then it may not work correctly any more.

Marlin · February 18, 2015, 1:51pm

Of course my full disk is a real problem, and of course it lets stuff fail in awkward ways. I'm not blaming Knime for that. But that's no excuse for these two glaring design flaws:

If there is a strategy, make it work. You're appending garbage to identifiers at so many places, why not here?
Don't do vital checks in one of the most computationally expensive nodes at the very end. Oh, that would make it more expensive? Then let me point to flaw 1 for a way out...

thor · February 18, 2015, 8:54pm

I don't see a design flaw here. I have already pointed out why we cannot perform full duplicate row ID checks right when you add a new row. And I don't get your comment on the strategy.

Marlin · February 19, 2015, 2:27pm

Every node has some strategy to handle duplicate identifiers, be it row IDs or column names. Almost all strategies involve appending something to the identifier. There's the "(#1)" suffix, there's the "(Iter #50)" one, there's the combination of column names in pivots, the concatenation of the used method and possibly a star in GroupBys, the appending of a number to every row in a Loop End...

And the worst strategy is to fail - unless explicitly told to so so, as in a Loop End. You can get away with failing when the problem is in a column name, and if it can be changed directly. For example, I don't have a problem with the Column Aggregator doing it. But row IDs? Why would I care about row IDs unless I'm a database? The only time I cared about row IDs was either when I needed a second column that some node wouldn't allow, as an ad-hoc joining column in a Loop End (Column append) or whenever some node cared too much about it. In other words, as a hack. Many nodes even ignore it. For example, I can't use a GroupBy to just count row IDs, I have to use a dummy column. It would be nice to use the row ID column for something sometimes, but the handling strategies mess them up anyway, so regular columns are the better way.

Bottom line: row IDs are a flaw, but a bearable one most of the time. I don't expect you to change anything about them.

Now, the normal strategy of a Joiner when it comes to row IDs is concatenation. Which is fine because I don't care. Just whatever, get on with the important stuff: Joining. If there's some error in the Joining, ok, I understand that. If the environment is unsuitable, ok, that's my fault. But row IDs? There is no, never was, and nerver will be a reason to care about those. Warn me or whatever. You're warning about crap all the time. Unless we're talking about row IDs, which produce errors, because they are soo important.

Yes, I admit, the passive voice of this particular error message may have been the icing on the cake of my frustration. It says someone failed, as if implying the failing one is me, but without willing to address me.

This is the wrong strategy. Failing is the wrong strategy. We people in the real world care about results. The results may be wrong, but wrong results we know about is better than an error that just tells us something was wrong somewhere.

Now I may have pissed you off a bit... but maybe I can help with a suggestion: Keep these brick walls. Nurish them. But only in the free version. As I said: heavy users will care. Yes, that's a nasty way of doing it, but at least it's something your users can't hack around as easily as around some of your other commercial additions.

agaunt · January 9, 2018, 9:52am

I ran into the same error message and I understand Marlin's frustration about the late error message. In my opinion, it should be possible in the joiner node to allow for setting completely new row ids, as in the column appender node. Then this check would be obsolete.