All possible combinations of columns

goodvirus · June 7, 2023, 8:41am

Hi dear KNIMER´s,

I have a problem and maybe you can give me some tipps to solve it.
I have 7 columns A, B, C, D, E, F and an Cluster.
The cluster is always present, but in the other columns there are a lot of missing values.
What I want to know is which combination of columns (so no Missing values) is primarily represented for each cluster. Extra points when I get and actual number or the 3 beste combinations that are primarily filled.

To make an example

Here the primarily combination would be A, F and this combination represents 100% of the rows in the cluster_1

If we adjust the example

the primarily combination would be A, D, F and this combination represents 57% of the rows in the cluster_1

I´m currently working with 2 million rows but this is and one time thing, so performance is not that big of an issue, but it shouldn’t take days…

Best regard,

Paul

gonhaddock · June 7, 2023, 9:15am

Hello @goodvirus
I’m just guessing but, aiming to keep it simple as an approach…
What about duplicate Cluster’s column and use a ‘Group by’ grouping by all Available Columns, and $Unique count(duplicate_cluster)$ as aggregation.

Then you can calculate the combination percentage per cluster; then ‘Top k Selector’ and so on…

I hope this brainstorming can help in your challenge.
BR

goodvirus · June 7, 2023, 2:19pm

Hi gonhaddock,

to be honest I didn’t got it to work with KNIME so I just used a python script to do it.
If someone has the same problem:

import pandas as pd
import numpy as np
from collections import Counter

df = pd.read_csv('file.csv', sep=";")
def get_filled_columns(row):
    return tuple(sorted([col for col in row.index if pd.notna(row[col])]))

# Exclude 'Cluster' column
df_without_cluster = df.drop('Cluster', axis=1)
df_without_cluster['filled_columns'] = df_without_cluster.apply(get_filled_columns, axis=1)

cluster_combinations = df[['Cluster']].join(df_without_cluster['filled_columns']).groupby(['Cluster', 'filled_columns']).size().reset_index(name='count')

top_combinations = cluster_combinations.groupby('Cluster').apply(lambda x: x.nlargest(3, 'count')).reset_index(drop=True)

total_cluster_counts = df['Cluster'].value_counts().to_dict()
top_combinations['percentage'] = top_combinations.apply(lambda x: x['count']/total_cluster_counts[x['Cluster']] * 100, axis=1)

And works fast enough (30 seconds)

system · June 14, 2023, 3:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.