Have a given value in one column based on another columns value.

I have a dataframe that looks something like this.

Var1    Var2
123     ADH6
123     ADH6
123     ADH6
123     ?
56756   CYP4V2
56756   ?
56756   ?
rs98    NUP62
rs98    ?
rs98    NUP62
rs98    NUP62

Var1 and var2 should be consistent 1 to 1. So whenever you have a value in var1 you should have the correspondant value in Var2.
It’s more clear if you visualize.

Var1    Var2
123     ADH6
123     ADH6
123     ADH6
123     ADH6
56756   CYP4V2
56756   CYP4V2
56756   CYP4V2
rs98    NUP62
rs98    NUP62
rs98    NUP62
rs98    NUP62

The problem is that I have thousands of different values.

So I can’t create a rule engine for these specific cases, I need a rule that says something like: if Var2 is missing, then fill with the value of Var2 when same Var1.

Hi @RoyBatty296

Since the empty rows can be anyhere and a next/previous row logic wouldn’t not be very solid, I would opt to go for a group loop based on Var1. Then you can, per group, supplement the nulls with the most frequent value of Var2.

It looks like this:

Initiate the loop based on Var1

Supplement the missing values:

Loop end:

Hope this provides some inspiration!

2 Likes

@RoyBatty296 ,
[Edit: Just saw @ArjenEX 's post. This is basically what is mentioned as the alternative)

If you always have at least one correct value for each item, and you are able to order them by item, with the “present” value appearing first in each case then you could just use the Missing Value node by itself and specify “previous value” as the population method

That should be able to do it in a single hit, but as I say, you will need the ordering correct.
image
image
image

2 Likes

Great approach as well. Indeed falls with the proper sorting to make sure that all nulls are listed below the available values per group in case the first value of the group is null.

1 Like

Hello there,

I would go for dictionary table creation (using GroupBy or Duplicate Row Filter node) followed by Cell Replacer to bring values in new column and finish it with Column Filter node.

Approach with loop and missing value could be slow in case of large data volume and sort plus missing value approach assumes there is no group with all missing values.

Br,
Ivan