Potential bug in the Joiner node

Hello,

Since version 4.6.0, KNIME’s changelog has announced enhancement “AP-17553: Allow keeping RowIDs when joining”. This appears to have been done in the context of this forum topic

Unfortunately, it appears that, despite the new “Keep row keys” option in the Joiner node, the recent Joiner node still forces you to concatenate row identifiers even if the row identifiers continue to ensure uniqueness in the joined set.

Let’s take an example. You want to add a column from another data set (hereafter named the “right set”) to an original set (hereafter named the “left set”), based on an attribute column in the original set, which happens to uniquely identify the rows in the right set. This join is equivalent to a simple left lookup, which will and should not change the number of rows in the left set.

Let’s also suppose that the cardinality of this attribute column (which allows us to include additional information from the right set) between both sets is “many to one”, in other words, for one row in the right set, you have many rows in the left set.

If you try to perform a LEFT OUTER JOIN using Joiner node (ticking “Matching rows” and “Left unmatched rows”) while matching on the said attribute column, Joiner forces you to tick the “Concatenate original keys” option. If you try to tick the “Keep row keys” option, Joiner refuses to be executed by throwing an error beforehand, such as: “Cannot reuse input row keys for output. The Join criteria do not explicitly ensure the equality of row keys: the output table contains matched rows but row keys are not among the matched the join columns”.

This error message is confusing because when I ask Joiner to split the results into 3 sets (matched, left unmatched, right unmatched), the right unmatched rows set has zero observations. This confirms that what happend was a LEFT OUTER JOIN. Please note that the matching criteria did not actually specify the Row Identifiers in the left set but only an attribute column therein.

If you use the now deprecated Joiner node to perform the same transaction, the row keys remain indeed unchanged in the joined set and they continue to ensure uniqueness of the rows in that set.

Given the behaviour of the deprecated Joiner node, why does the new Joiner node still force the concatenation of row identifiers ? Could this be due to a bug ?

Kind regards,
Georges

Hi @Geo -

Thanks for the detailed feedback. I’ve asked internally about this - let me see what I can find out.

1 Like

Hi @Geo,

Thanks for the feedback. In principle you’re right: In the scenario you describe we could keep the row keys, because the attribute you use for joining uniquely identifies the row in the right set (like a foreign key).

Unfortunately, the uniqueness of the attribute is something that the joiner currently does not see and we decided not to compute it for performance reasons. Since we cannot safely assume it, the “Keep Row Keys” option is sometimes overly cautious and disabled even though it could be enabled. Maybe using the Cell Replacer would be a good alternative for you?

As you probably see, row key clashes could occur without the attribute being unique. And at the moment, there’s nothing that prevents you from adding a row to the right set that destroys the uniqueness of the attribute you’re joining on in which case there could be more than one join partner for a row in the left set and simply reusing the left row key would lead to row key clashes.

As for the behavior of the deprecated Joiner, I’m also getting concatenated row keys. I’ve prepared a little example workflow: Keeping Row Keys Example – KNIME Hub

Hope that helps,
Carl

Hi @CarlWitt

Thank you for posting the examples, which clears things up.

When you define RowID in both tables with “keep row keys” and match on RowIDs instead of the columns, then no concatenating takes place, which is desired behaviour. Now when matching on the columns instead, it always concatenates the row keys, even with the Joiner (deprecated) node.

Right now, in the new Joiner node, the error is thrown before it can be executed. Does this not kind of defeat the relevance of the Keep row keys option because it effectively only works when matching is done on RowIDs ?

Would it not be more useful to throw a warning instead informing the user that this choice (i.e. “keep row keys” with matching on anything else than row keys) could negatively impact performance? This would then lead to an error after execution if uniqueness could not be ensured after joining.

Kind regards
Geo

Hi @Geo,

Following the discussion, I tend to agree that this does appear to make the option to “keep row keys” almost superfluous. Given this, could a component assist by attempting to reset the ROWID following the join? :slight_smile:

image

image
image

or taking from the Right…
image

image

2 Likes

Hi takbb,

Thank you for the elegant workaround suggestion. My current workaround is to save the RowID into a column before joining, then to join, thereafter to set the RowID again:
RowID: creates a rowid column
→ Joiner
→ RowID: consumes the rowid column and checks for duplicates

Kind regards
Geo

1 Like

Hello,

This being said, a workaround is not a viable solution to the actual issue. Right now, it appears that the user has the choice to keep the row keys. However, this choice is de facto inapplicable in most cases because there is a risk that the uniqueness check could impact performance.

Therefore my suggestion: give the choice back to the user.

Would it not be more useful to throw a warning instead informing the user that this choice (i.e. “keep row keys” with matching on anything else than row keys) could negatively impact performance? This would then lead to an error after execution if uniqueness could not be ensured after joining.

Kind regards
Geo

Hi @Geo,

The convention with KNIME is that a node that is not configured properly cannot be executed, that’s what we followed here.

I remember we discussed this whole topic when implementing the feature and we opted for the more restrictive way since it’s easier for the power users to work around an inconvenience than for the less savvy users to deal with failures of a seemingly properly configured joiner.

For instance, if we allow to enable keeping row keys in uncertain situations, we could display a warning message on the node that the computation might fail if the join attributes don’t form a foreign key. Although I clearly see how this would be convenient for our power users, it is less user friendly for the less savvy users.

Thanks again for your feedback! Going forward, I’d rather turn this feedback into a discussion on integrity constraints, knowledge about column domains, etc. that would allow the Joiner to figure out the correct answer, rather than adding a quick fix to the Joiner.

Kind regards,
Carl

4 Likes

Hi @CarlWitt

I fully agree with your points and the suggested way forward. :+1:

Kind regards
Geo