How to classify a highly unbalanced dataset? KNIME exist node for "Tomek links" under sampling?

Hi,

I meet a challenge to classify a highly unbalanced dataset, where False type data are only round 1% of total dataset. I have tried SMOTE over-sampling but it didn’t solve the problem.
Now I would like to try under-sampling with Tomek links, and wondering if exit Node in KNIME that I could choose to do Tomek links under-sampling?
And for this kind of highly unbalanced dataset, are there any other methods to increase the classification accuracy?

Thanks in advance for your helpful answers!

BR
Mei

Hi @Meihong. I’ve used the SMOTE node. It works well but heavily increases the computation time.

1 Like

What does your workflow look like? Build in python or all in KNIME?

@Meihong you could take a look at this debate

3 Likes

It build in all with KNIME node.

The purpose of this task is to find the key feature which cause the fail.

Is that because the settings that I made for SMOTE are not suitable? This is what I used in SMOTE now
image

Is that Tomek link doesn’t have node in KNIME? We need to use Python to build it?

No those are reasonable values for the parameters. The issue must be the signal is weak as explained by @mlauber71 in the thread

Hello @Meihong,

To my knowledge there is no dedicated node in KNIME to perform Tomek Links method to do undersampling. Python is probably a way to go.

Adding link to KNIME Python integration guide in case you don’t have it configured and not aware of it:
https://docs.knime.com/latest/python_installation_guide/index.html

And link to new blog post with steps on how to configure it fast and easy:

Br,
Ivan

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.