I’m trying to filter out some columns which may be obsolete for the purpose of an analysis.
However, I’m finding it really difficult bearing with the following:
- The dimensions of the table are 2m rows x 800 columns
- Some of the fields may have > 30k unique categories
- Some may be empty or have just 1 value
Now I tried Data Explorer (JS), but it wasn’t much help, because the sorting on missing or unique values count works only from left to right. That is, I get order like 1, 10, 100, 1000, 11, 110, 111, etc. Also it doesn’t support multiple selection (e.g. mark, press Shift, mark again) This way it’s not really efficient to go through all the fields and pick them manually.
Second, I tried GroupBy with Unique Count for each column. My surprise came seeing this (attached pic).
Oh, and feature selection is not an option, as I waited some tree loop-build over 12 hours with no real intentions on stopping.
So, please, review this issue with the two nodes and direct me to a way of doing this more automated-ly than going through each field’s stats and comparing the result to table before deciding what to leave.