How to eliminate unused columns

cybrkup · February 18, 2020, 8:13pm

Hi all… I’m not sure where to start with this.

I have a table with 2000+ columns, but most of them are probably never used so I want to eliminate them. However, I don’t know in advance which ones can be eliminated and moreover, the list of which ones can be eliminates is not constant. This is for time-series forecasting, so I can’t peek into the future, so to speak.

Example:
With a table of 2000+ columns and 10000 rows, I’m using a Window Loop to step through a size of 1000 rows at a time and using dimensional reduction to remove unnecessary columns. The next Window steps forward 1 row, analyzing the 1000 rows to identify unnecessary columns (which includes 999 rows that were already analyzed). There is very little variation from one step to the next, but over time the drift in columns retained can be significant.

This is taking a lot of time. As a pre-processing step, I want to identify the “core” set of columns that will be used in any of the later processing steps. I would strip down my dimensional reduction process to speed it up significantly for this pre-process step; it just needs to get the ballpark list of columns. I can do step sizes of 10 (or whatever), but will loosen the criteria significantly to not eliminate columns that might otherwise be used. It is critical that I not eliminate any columns that would otherwise be included if I hadn’t done this pre-processing.

Later, using the “core” columns, I can then run the through dimensional reduction process and it ought to be much faster overall… at least that is my working theory.

Any ideas how to implement this?

I’ve tried to use the Window Loop, but then I end up with a table that has a massive number of duplicate rows (although an efficient set of columns).

Thanks in advance.

elsamuel · February 18, 2020, 10:12pm

Hi @cybrkup

What technique are you using to reduce the dimensionality?

cybrkup · February 19, 2020, 12:00am

Low Variance Filter - 0.01
Missing Value Column Filter - 20% threshold
Constant Value Column Filter
Linear Correlation / Correlation Filter - 0.5 corr, on each unique input file. Then I join all those results and do another round of 0.8 on the table of all joined inputs.

From the typical starting point of 2000+ columns, it typically gets down to <100 final columns for any given batch of input rows.

system · August 19, 2020, 12:00pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.