[bug or feature ?] Row Filter - Specs remain of original, not filtered range

docminus2 · October 14, 2020, 12:02pm

I noticed how the specs of a column after Row Filter by e.g. a number value didn’t change as expected/intended.
E.g.
Table with Column Int values, let’s say. 1 - 3000. Filter Row by max value 250.
You (well, I at least) would expect the new spec to be 1 - 250. But it still is 1 - 3000.

I vaguely remember that Knime uses wrappers and all kinds of stuff where I imagine stuff to be present that I don’t see. Though in this case it doesn’t feel intuitive.

I tested the Cache node, no difference. Had to duplicate the column, then it was ok. Use case for me was the ColorManager that follows which would give me an unwanted color range.
So, my q is, is this intentional behavior?

See the workflow example I made as a demo, in case my explanation isn’t clear enough?
https://hub.knime.com/docminus2/spaces/Public/latest/row%20filter%20bug%20or%20feature

s.roughley · October 14, 2020, 12:13pm

The column domain (which is what you are referring to) isn’t recalcualted on every node - I assume because it takes time/processor power/memory to do. Most of the time that’s not much of an issue but for filters/splitters it is annoying. You can force the recalculation using the Domain Calculator node:

Incidentally, if you find the default 60 values for String column domains frustrating then there is a fix here:

Steve

Luca_Italy · October 14, 2020, 12:37pm

I’ve converted 3000 integers to strings and using a domain calculator (deflagging the limit to 60) i can obtain all of them in a colormanager, without using the fix.

s.roughley · October 14, 2020, 12:53pm

Yes, that works well as an alternative for specific instances.

Steve

ipazin · October 14, 2020, 2:14pm

Bug
Feature

0 voters

docminus2 · October 14, 2020, 4:15pm

Interesting vote; before I vote, is there a reason for this be a feature that I don’t see? Then there is perhaps no need to change this.
There are work-arounds as described above (I for example duplicate the column), so it is more a question of being sufficiently aware of this?

Also, thanks for the answers so far.

Luca_Italy · October 15, 2020, 7:24am

well, think about it like a “data lineage”, you can “see” the original range of values before any kind of manipulation, should be useful in some cases. Whenever you need to “see” the actual range, no problem, use a domain calculator.

But… rethinking about it… maybe this implementation is counterintuitive, i mean, i want to see the actual range of values indipendently of the original ones without being forced to “refresh” the range with a domain calculator… mmmm… the default behaviour should be actual, with the possibility of a pseudo domain calculator to see the original range

Luca

wiswedel · October 15, 2020, 8:46am

It was a design decision to not change the domain of a column when you filter out rows. So the contract is: There is no value in the column that violates the domain information. However, the domain may be too ‘wide’ and may contain more values than present in the column or have bounds that are smaller/larger than the minimum/maximum found in the column.

One of the use cases is… a predictor node spits out class probabilities and sets the range in the column domain to [0, 1], though the actual probabilities for the (current) data are never 0 or 1. In a downstream scatterplot node we would use the domain value to initialize the axis range and plot the data over the possible domain, rather than the actual range of the data (e.g. [0.3, 0.7]). If passing the data through a filter node would break the domain we would not be able to support this case.

Put differently: It’s very easy to determine a strict domain (see above, Domain Calculator) but it’s hard to restore the original domain.

Hope this helps!
– Bernd

docminus2 · October 15, 2020, 12:51pm

I guess this summarizes as the solution at the same time.
Thanks.

For the way I work I would still consider this as counter-intuitive if there is nothing to point this out in an obvious way.

s.roughley · October 15, 2020, 4:48pm

Thanks Bernd - that makes a lot of sense (or to put it another way, “Now, why didn’t I think of that?”!)

How about an option in the node dialog to optionally recalculate?

Steve

system · October 22, 2020, 4:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.