Select data based on reference data

Hello KNIMErs,

My today’s challenge refers to selecting subset of data. The simplified example could be described as below.

My sample dataset looks as follows:
image

I need to select subset of data based on combination of:

  • List of reference row unique identifiers which is limited to ID_2 and ID_3
    image

  • Reference list of strings that is somehow reflected in the names of columns. For the purpose of the example, reference string ‘A’ is reflected in the column names ‘A*’ where * is a wildcard and reference string ‘B’ is reflected in the column names ‘B*’ where * is a wildcard.

Sample data could be found in the attached workflow (two tables: data set and reference IDs and strings)
Data_selection.knwf (74.3 KB)

As a result, I would like to keep data for ID_2 row / all ‘A*’ columns and ID_3 row / all ‘B*’ columns and present this as a list:
image

The real dataset contains thousands of rows and hundreds of columns, and reference list could contain thousands of rows. Thus probably loops will be quite slow in execution.
Any suggestions for solution?

Happy KNIMEing!

Although I do not quite understand the part of renaming the columns, I think the other part of the requirements could be solved like this without use of loops.

https://hub.knime.com/s/f-1vBMaXeErQAsOU

4 Likes

First appearance in this forum and such a masterpiece! Welcome @haozi04

1 Like

Hi @ActionAndi Thank you! You flatter me! I guess I should retire at the top of my game. :rofl:

I love how people are willing to share experience and help each other here. Such a great community.

2 Likes

Welcome to the Forum @haozi04

You proposal is great - thank you.
However… :slight_smile:

  1. Renaming column means that reference string ‘A’ stands for any column name that starts with ‘A’ and then column names vary. Please thing about ‘Index’ as reference string and ‘Index’, ‘Index - West’, and ‘Index - East’. Thus, what you have proposed is very correct.

  2. I have adapted your solution to real dataset and faced two issues:

  • The outcome should have 5 columns in total. However if uniqe identifier refers to both reference strings (‘A’ and ‘B’)
    image

    then the number of outcome columns increases.

  • If dataset contains empty data field, then it is not recognized for the outcome. Thus, the total number of columns in the sample outcome is 8 (not 9). But let me repeat: the outcome should contains 5 columns only. Unfortunatelly, there is not much on the Forum on replacing elements of collection :frowning:

Attaching proposed workflow with updated sample data for potential further modifications
Data_selection_for_Forum.knwf (88.6 KB)

Hello! This dissociation between the reference row info and the outcome showed made it complicated. In the reference, there wasn’t any indication to associate ID_2 with B*, and ID_3 with A*, but in the outcome, the column names are not heading that direction.

Any chance you’d want to compromise on the outcome’s column names to rename them to anything? Otherwise, you’re gonna need additional info from the reference table.

2 Likes

I think we still have some unclear relations here regarding the renaming and column counts, like @badger101 has pointed out.

I read the post again and again, maybe better if you could verify my guess below:

  1. The dataset is exactly 5 columns: 1x “ID” column and rest 4x are columns that start with a certain string to be used. There are EXACTLY 2 such strings, in this case “A” and “B”.
  2. In case that the ID “ID_3” have a column reference “B” - we should first take the corresponding cell in “B*”, namely “wewgg546y” and “453ht”
  3. And then we should also take the column name of the above matched cells, namely “B_1” and “B_2”
  4. This makes a total captures of 4 strings. And together with the ID, that’s how we should always have 5 columns

If you could check if this is exactly what you need, then maybe we could find something interesting to work😉

Regarding the missing value, this could be simply solved by add a Missing Value node to replace all missing values with a string, e.g. “empty”, and then after all the work, replace them back with missing values by using a String Manipulation (Multi Column) node:

toNull(replace($$CURRENTCOLUMN$$, “EMPTY”, “”))

3 Likes

@badger101 & @haozi04 modified sample data could be found in the uploaded ‘Data_selection_for_Forum.knwf’ workflow.

Let me confirm that I would like to get the outcome containing 5 columns.
Also please give me some time to move forward with ideas @haozi04 already shared.
Thank you.

1 Like

@badger101 & @haozi04 Thank you again for your support.
Finally, I have used part of haozi04’s proposal, then step back, and re-approached the challenge from the different angle:) Thanks to this, I have resolved the challenge and completed my workflow, which generates what I expect.

Part of my workflow that overcomes the challenge looks like this:

haozi04’s proposal was direct inspiration for the upper branch.
The lower part is a result of settping back and re-approaching.
I’m marking haozi04’s proposal as the solution to reward him in a special way.

Sorry for uploading graphic only, but the workflow contains company data.

Happy KNIMEing!

2 Likes

So glad that you’ve managed this complicated challenge and I’m happy to be helpful. Great job @Kazimierz !

1 Like