How to filter columns using flow variables?

gcincilla · December 2, 2014, 11:12am

Hi guys,

I'm facing a problem while doing something straightforward in theory. I want to filter out from a certain data table one column which name is defined in a flow variable using the column filter node.

To do this I'm manually selecting a dummy column to exclude, in the column filter tab of the node, and then I select the flow variable with the name of the real column to exclude in the excluded_name → 0 drop down menu of the flow variables tab. Anyway this does not filter out any column from my data table. I'm quite sure I'm doing something wrong. Can you please point me out what it is? An example workflow is attached.

Thanks,

Gio

test_colfilt_flowvar.zip

Marlin · December 2, 2014, 11:28am

I never got filtering by flow variable to work this way. Instead, I use the Wildcard/Regex Selection option where I use a flow variable as the pattern in the field name_pattern/pattern. To select multiple Columns select Regular Expression and concat the names with a "|" in between (GroupBy can do that for you), and to invert the selection use a Splitter instead.

Marlin · December 2, 2014, 1:04pm

I wasn't happy with not being able to get it to work without regular expressions. After some testing it seems both the content of the flow variable and the name of the flow variable have to be the name of the selected column.

I'd say it's a bug. Not a major one, though, because even without, this way to filter columns is just awkward.

gcincilla · December 2, 2014, 3:40pm

Hi Marlin,

Thank you for you suggestion. I already managed to solve the problem using a regex in a column splitter node, as you suggested in your first answer. Anyway it seems strange to me that it is not possible doing that in the original way I meant to do it. Can it be a BUG?

To what deals with the fact that “the flow variable and the name of the flow variable have to be the name of the selected column” I cannot manage make it work in that way. Moreover that would be useless as normally you want to use flow variables to generalize part of the workflow.

Thank you anyway,

Gio

Marlin · December 3, 2014, 7:33am

Well I just played around yesterday and that seemed to be the pattern. I just tried it again and now it seems I may "only" have triggered some strange interaction between this and the Enforce inclusion/Enforce exclusion.

My newest guess is that flow variables as inclusion seem to work as an "also, include ..." if exclusion is forced, and as a replacement if inclusion is forced, while variables as exclusion seem to do nothing if inclusion is forced and to push the column they replaced to the inclusion list while not affecting the column they name. But I may be wrong again. In short: It's a mess.

From what I gather from the forum (this question seems to be a recurring one), the reason is simply that flow variables, like flow control nodes, are a later addition with some inherent flaws. On of them is this inability to work with arrays in a sensible way. The need to seed the initial column lists is the first major cue. Of course now Workflows may depend on this feature with its flaws, so removing it or changing it may come with a cost... and there are bigger issues to solve first, as replacing by pattern works perfectly.

gcincilla · December 3, 2014, 10:02am

OK Marlin, thank you so much for your effort in trying understand the node behavior. Now it's a bit more clear.

I agree with you: this is not a big issue as it can be circumvented with the workaround you suggested in your first reply. Anyway it would be good solve it in the future.

Cheers

thor · December 3, 2014, 10:04am

I don't think it's a mess, controlling the columns works exactly as expected. If you set "Enforce inclusion" only the list of included columns is used during execute. Therefore, changes to the exclude list don't have any effect. If you override one of the included columns then the output changes accordingly. The opposite is true for "Enforce exclusion": the include list is completely ignored, only the exclude list matters. All columns not present in the exclude list will end up in the output table.

Marlin · December 3, 2014, 10:32am

If what I think the rules might be (see above) actually are - which I am not certain of - you are only partly right, thor. Inclusion and exclusion are still handled differently, which, at least for me, is absolutely unexpected. The outcomes are heavily biased towards inclusion of columns, and that might even be the case without flow variables.

Is that a good choice? Yes, because throwing away data can easily be a mistake. And no, if your other nodes depend on your specs.

aborg · December 3, 2014, 10:53am

I also think that this is consistent and see no huge problems there, but it is a bit inconvinient to make the other (not forced side) empty during configuration to avoid the warning about columns both in the included and excluded list.

thor · December 3, 2014, 11:36am

I don't see why they are handled differently. For "Enforce inclusion" all columns not in the include list are excluded. For "Enforce exclusion" all columns not in the exclude list are included. This is absolutely symmetric.

Marlin · December 3, 2014, 12:04pm

Here's my testing flow. If both lists were handled symmetrically, both output lists should have the same number of columns. But at least in my instance, with Force inclusion you can change the current include list with flow variables, but never filter more or less elements than you originally chose. With Force exclusion, almost anything will push columns to be included, so you can never exclude more, but easily filter less columns than originally. Which, in my eyes, is a bit counterintuitive. Forcing exclusion so you can include...

colfilterbyflowvar.zip

SOH979 · June 3, 2015, 4:28pm

I'm also finding it troublesome to get this to work when putting flowvariables into the excluded_names array and having the enforce exclusions checked.

Cheers,

Steve.

ematton · July 2, 2015, 11:56am

^2

Working a couple of months with Knime, following the example workflow above I cannot even get mine to work.

Basically, I have a table (let's say table 1) column containing some names of columns of another table (let's say table 2). I want to exlude all the columns of table 2 that are listed in the column of table 1. I prefer not to do this manually, because the values in and size of the table 1 column are dynamic (depending on earlier calculations).

This is not such an extravagant thing to ask from an analysis platform, right?

So what I've done is extracted the values of table 1 column into flow variables (where the actual "columns to exclude" names are in the flow variable values, the variable names are set to the rowIDs - from the comments above I can see this could be the problem but I have no clue how to alter the names). At that point I expected to be able to import these variables in the column filter node to exclude the columns in table 2 dynamically, which apparently doesn't simply work that way. That's where I lost all sense of logic in this matter and don't know how to proceed:

- how to set the exlude/include lists and enforcing options

- where to enter the flow variables (in the "exlude names" dropdown doesn't give the expected results)

- do I need to specify the number of columns to exclude (is this "array size"?) in yet another variable (as this is also dynamic)?

I'm getting the feeling I'm either missing a vital point here; otherwise this is exactly the kind of trouble that would bother me in continuing to use KNIME. In R this whole issue would be a one-liner...

ematton · July 2, 2015, 5:14pm

A classic case of overcomplicating things, I'm afraid: no need for flow variables after all in my example.

A few filters, rowID replacement, transposing and using a reference column filter do the trick.

Takeaway of the day: if you can get column filter specifications in data format instead of flow variables; your life will improve :)

petermeissner · June 22, 2020, 1:05pm

Exclusion by FlowVariable seems to work just fine … except its all but clear why I have to use the second entry aka entry-“1” to make it work.