Porting Knime one to many node into python?

Background: I’m working on some Kaggle challenges

Problem: The one to many node in Knime labels columns differently than the one hot encoder in Kaggle’s Python/Jupyter notebooks.

As a result, the models I built and trained in Knime don’t work correctly.

What’s the best way to address this issue? I’m slowly picking up python so the Knime environment is easier to work in for me.

Hi,
are you sure that there’s a difference? I just checked with a simple example and it’s very similar (KNIME uses integer instead of double values):
Table Creator:
image

import knime.scripting.io as knio
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Get data from KNIME
df = knio.input_tables[0].to_pandas()
df = df.reset_index(drop=True)
cols = df.columns

# EDIT HERE!
cols_to_convert = ['Color', 'Size']
other_cols = list(set(cols) - set(cols_to_convert))

# initialize OneHotEncoder 
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' category

# transform Categoricals
encoded_data = encoder.fit_transform(df[cols_to_convert])

# Write result in dataframe
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(cols_to_convert))

# add numerical cols
final_df = pd.concat([df, encoded_df], axis=1)

# output
knio.output_tables[0] = knio.Table.from_pandas(final_df)

image

I used the “Math Formula (Multi Column)” Node to convert the integer columns to double.

Results:
Python:

KNIME One to many:

PS: ignore the red sign at my Python Script, I was testing something else…

Thanks for the quick reply!

As you can see from your screenshot of Python and Knime One to Many the column names are still different.

i.e: Color_green vs green, Size_M vs M

The Knime trained models expect exact matching column names, and when I use the Python encoder the different column names throws everything off.

How about renaming them?

1 Like

Hoping for a reusable solution. :grinning:

My use case has 60 something columns, and I’m trying to avoid having to do that for all future cases I need one hot encoding.

@Durkweed it seems you can just add the column name with an underscore. You would have to construct a loop.

Hi,

yeah you are right, thats a bit tricky for 60 cols.
Here is a solution that might help.

And in addition I found a strange behaviour of the “One To Many” node.
By default it does not add the column to new columns and just uses the entries. Like “red” instead of “color_red” or “red_color”.
If two columns have similar entries it does add the column name automatically to avoid doublettes in columns names. For example:
image

Leads to columns with

But:
image

to:

I will add this topic to the “Bugs” Section

1 Like

@mlauber71 @ActionAndi

Thanks for the suggestions.

To be more precise here’s a screenshot of the issue.

The left column is the original. The middle column is what I’m getting with the one hot knime node. The right column is what python outputs.

Looking at it with fresh eyes, it seems that Knime appends the column names to the end instead of the beginning and as ActionAndi found only if there are similar names? Not sure of the exact behavior of Knime.

Yeah.
Look at the workflow I’ve shared. I’ve tweaked the naming so it matches with the python names.
If you need other naming conventions just change the “string manipulation” node within the loop

1 Like

Hello @Durkweed and welcome to the KNIME community

You can take a look to the following post, because this component in the Community-HUB’s workflow, will take the encoding job for the columns; And it would be helpful for your use case.

Be aware -as explained in the post-, that the workflow drops a column for each set, aiming to avoid ‘dummy trap variable’

BR

Agree, we plan to tackle this when transitioning the node to the new dialog (internal ticket reference UIEXT-1904).

1 Like