Filter out categorical fields

I’m trying to filter out some columns which may be obsolete for the purpose of an analysis.
However, I’m finding it really difficult bearing with the following:

  • The dimensions of the table are 2m rows x 800 columns
  • Some of the fields may have > 30k unique categories
  • Some may be empty or have just 1 value

Now I tried Data Explorer (JS), but it wasn’t much help, because the sorting on missing or unique values count works only from left to right. That is, I get order like 1, 10, 100, 1000, 11, 110, 111, etc. Also it doesn’t support multiple selection (e.g. mark, press Shift, mark again) This way it’s not really efficient to go through all the fields and pick them manually.
Second, I tried GroupBy with Unique Count for each column. My surprise came seeing this (attached pic).
Oh, and feature selection is not an option, as I waited some tree loop-build over 12 hours with no real intentions on stopping.
So, please, review this issue with the two nodes and direct me to a way of doing this more automated-ly than going through each field’s stats and comparing the result to table before deciding what to leave.

Cheers!

Hi @deicide_bg

As a start you can reduce the columns with “less or no” information. Take a look at Missing Value Column Filter node and the Low Variance Filter node
gr. Hans

3 Likes

Hi @deicide_bg -

HansS has some good suggestions to get started. You may want to check this KNIME blog post from a few years back, along with its associated workflow for a few additional ideas.

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.