Hi all… I’m not sure where to start with this.
I have a table with 2000+ columns, but most of them are probably never used so I want to eliminate them. However, I don’t know in advance which ones can be eliminated and moreover, the list of which ones can be eliminates is not constant. This is for time-series forecasting, so I can’t peek into the future, so to speak.
Example:
With a table of 2000+ columns and 10000 rows, I’m using a Window Loop to step through a size of 1000 rows at a time and using dimensional reduction to remove unnecessary columns. The next Window steps forward 1 row, analyzing the 1000 rows to identify unnecessary columns (which includes 999 rows that were already analyzed). There is very little variation from one step to the next, but over time the drift in columns retained can be significant.
This is taking a lot of time. As a pre-processing step, I want to identify the “core” set of columns that will be used in any of the later processing steps. I would strip down my dimensional reduction process to speed it up significantly for this pre-process step; it just needs to get the ballpark list of columns. I can do step sizes of 10 (or whatever), but will loosen the criteria significantly to not eliminate columns that might otherwise be used. It is critical that I not eliminate any columns that would otherwise be included if I hadn’t done this pre-processing.
Later, using the “core” columns, I can then run the through dimensional reduction process and it ought to be much faster overall… at least that is my working theory.
Any ideas how to implement this?
I’ve tried to use the Window Loop, but then I end up with a table that has a massive number of duplicate rows (although an efficient set of columns).
Thanks in advance.