Join that avoids "Missing" data cells

Hi Folks,

I am trying to join two tables one of which has many empty cells and the other is full.

I would like to set up one join that says to “match as many cells as are available”.

I.E. if I selected “match all columns” it would miss matching anything for the first two rows because they are blank in column 2 or 3.

I could break out the table into every possible combination of columns and create separate joins for each scenario, but I would rather not do that, if possible.

image

image

Please check if left or right join works for you.

Hi,

Thanks for your reply. Left or right join isn’t relevant - that only determines which table is the master. I need a variable column join - to only join where there is not missing data.

Hi @Ilya2004,

if i understand you correct you want join data cell based and not column based or does the cell position (column, row) and the cell content should be considered?

Did you thought about first unpivoting the data therefore you have the option to join the data

a) based on the cell value or
b) based on the cell value and column name or
c) based on the cell value, column name and row number

afterwards you can pivot your final result to see onyl the data which matchs all your conditions.

In general your example does not make sence for me. Because your results represent finaly only your first table with blank values. Do you have additional information which you want to join and how is the logic behind that. Maybe it would be helpful if you could provide a more spelling example with input and expected output. In best case you provide (upload) the example files to make it easier for the rest of the community to coming up with suggestions or solutions.

BR

How are the missing values? Your example implies column 1 is never empty. Is that really the case or can column 1 be empty and 2 and 3 not?

If the missing cells are ordered from left to right as your example implies you can probably make a recursive loop that joins on 1 column and rows which have current column empty are send to out port.

If order of missing values is “random” then it gets more difficult. What actually about row order? Your example implies the 2 tables are presorter correctly. Is that the case? I assume not?

If no to both question you will have to make custom code in a python snippet and iterate for each row in table A over table B and search for matches. But that won’t be fast or easy. What if you have AB in table one and AB and ABC in table two? Should both row join or only the most common one?

Merge all columns into one and then to the join.
BR

Had that idea too but “A” + missing value wont join with “AB”.

Let me try to give a more specific example, maybe that will make it more clear:

Let’s say I have two tables. One has a list of individuals and some information about where they live. For some individuals I don’t know their full location, only state or only country. I have another table that has average incomes by country, state and city. I would like to estimate the income of each individual, and I would want to be the most accurate to use all of the information that is available - so if their city is available, it will give me a better estimate than if I use country data. My goal is to do a join that considers as many fields as are available.

In my real data, there are 8 or more variables and they are not hierarchical like city and state, so it would be not practical to do individual joins on each criteria type.

Table 1: Population Being Considered
image

Table 2: Reference of Average Income by Geography
image

1 Like

Hi @Ilya2004
i recommend to split your tables in 3 parts

part1) country, state and city available
part2) country and state available
part3) country available

join the tables based on their key information and append them finaly to one table.
This is in my opinion the simplest way.

BR

Yes, but the real table has 8 columns, and there is no pattern as to which data is available, so there could be a total of 8 x 7 x 6 x 5 x 4 x 3 x 2 different combinations of columns, potentially.

Ok, i unterstand.
Did you checked if replacing missing values with a keyword e.g. ‘unknown’ for your key columns is applicable. In this csae joiner node should be able to handle all combinations.

How would that work?

for each table create ID column like ID= contry + state + city + … than join by ID

2 Likes

Hello:
" I would like to estimate the income of each individual, and I would want to be the most accurate to use all of the information that is available - so if their city is available, it will give me a better estimate than if I use country data. "

I interpret the above as meaning, Please give a higher confidence level when more geographical information for a person. So, if we have a country + state + city, then would have a higher confidence level than it we just had a country + state or just a country.

Assuming that this is what you mean (?), I added a confidence level column to each table (called “conf”). The tables look like this, sorted by country + state + city. As shown in the 2 spreadsheets, one for population and one for geography (the latter includes income).

Now as to which type of join to use, I used what I call a left join, and if you want, you can have one result for high confidence levels (conf = country + state + city), another for a conf of 2, and another for a confidence level of 1.

I do not know knime well enough to do this in knime, but here are results from postgreSQL and the basic code:

create table public.conf1 as SELECT
r1.name as mname,
r1.conf as conf1, r1.country as co1, r1.state as st1, r1.city as ci1,
r2.conf, r2.country, r2.state, r2.city, r2.inc
FROM pop r1
LEFT JOIN geog2 r2
ON r1.conf || r1.country || r1.state || r1.city
= r2.conf || r2.country || r2.state || r2.city
–where r1.conf = ‘3’)
order by r1.conf desc

Here is a result, for conf level = 3, exported from postgreSQL into excel. For clarity, I have included the duplicative fields from the left and the right tables, although a briefer, more-to-the-point report would leave out these fields.

conf1.xlsx (13.2 KB)

More culling could be done, so that only one only gets the highest confidence levels. But this really depends on what the end user wants.

A more complete starting point would be for the end user to not only provide the 2 initial tables, as they have done, but to also provide a hypothetical table that exactly what the hoped for table(s) would look like?

Perhaps someone else will show how to do the above postgreSQL in knime?

Thank you,
Steve Elster

pop.xlsx (9.9 KB)

geog.xlsx (12.8 KB)

1 Like

Hi,
i think that the best solution is to create a unique ID by concatenating all the variables; where null, you can replace them with a placeholder, like “N/A”.

Missing join.knwf (702.9 KB)

Luca

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.