Filterring out columns

Hello KNIMErs,

I’m facing a challenge that I hoped shouldn’t be a challenge at all :blush:

I have two input tables:
The first (sample) table with header row and data below. Header might contain ‘ #’ string or not
image

The second (sample) table with reference header row. No ‘ #’ string here
image

I would like to filter out some columns and keep those columns from 1st table which have headers ‘similar’ to reference header row, like below
image

Of course, the full list of columns is longer then in the above example, and the set of reference columns is dynamic (it changes each time the workflow is executed).

How to do this?

I have tried several ways, but nothing that works for me so far.

Happy KNIMEing,

Kaz

Hi @Kazimierz

Did you try the Reference Column Filter ?
gr. Hans

1 Like

Hi @HansS
Yes, I have tried Reference Column Filter.
The think is that Reference Column Filter (and Reference Row Filter, too) finds exact matches. Thus, it matches ‘column’ columns, but doesn’t match ‘header’ columns. In case of ‘header’ the real match refers to ‘header*’ with the wildcard and I don’t know how to apply wildcard in any reference node.
Thank you,
Kaz

Hello @Kazimierz,

you are right. Reference filters are for exact matching. What you can do is prepare regular expression based on your column names and feed it as flow variable to Column Filter node where you can filter based on Wildcard/Regex Selection.

For regEx preparation you can use multi node approach or Column Expressions as a one node solution. Here is workflow example as well.

SpecialHeaderFiltering.knwf (38.3 KB)

Hope this helps!

Br,
Ivan

2 Likes

We are close :blush: @ipazin

I have executed your workflow on sample data and it worked like a charm.

Then, I have used real data and the workflow still worked, however I have discovered that variety of column names is bigger then expected. There are the following reference headers in your workflow:

  • column
  • header
  • abc
    and regular expressions refer to ‘column*’, ‘header*’ and ‘abc*’.

Unfortunately, there are columns named ‘column’, columns named ‘column #…’, and columns named ‘column something’. The expected matches are ‘column’ and ‘column #*’, while ‘column something’ should be omitted.

So, how to modify your workflow to get regular expressions that identify:
column or column #* where * stands for wildcard,
header or header #* where * stands for wildcard,
abc or abc #* where * stands for wildcard?

Thank you @ipazin for your support so far,
Kaz

1 Like

Hello @Kazimierz,

try with this expression in String Manipulation node:
join("(",$Column Header$,")","( #|$).*")

Br,
Ivan

2 Likes

Perfect! That’s it @ipazin

Could you please explain the meaning of particular elements of this expression? Or am I asking too much?

Thank you,
Kaz

1 Like

Hello @Kazimierz,

a bit too much as I’m not good at explaining regular expressions :sweat_smile:

In short it does what you are looking for. List of words in OR expression followed by either space and # char or end of string/line ($). You can check this web page for playing with and learning regular expressions:

Br,
Ivan

2 Likes

Thank yuo @ipazin
Have a good week ahead,
Kaz

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.