Filter large table

jeffreysdelong · February 27, 2019, 4:46am

I am building a deep learning network (using DL4J nodes) and am reading data from files with over 60K rows. I would like to only include every third row. I.e., I would like to filter out 2 out of every three rows. I transpose the table (so the rows become columns) so I could also filter on the columns if that was easier. Random selection of rows or columns would also be acceptable as long as I wind up with about 20K. I appreciate any suggestions on how to do this.

Andrew_Steel · February 27, 2019, 6:40am

Hi @jeffreysdelong,

take your table, append a new column with Math Formula and value $$ROWINDEX % 3 and Filter by 0,1 or 2 with a Row Filter.

Best regards
Andreas

s.roughley · February 27, 2019, 6:48am

You could also use the Partitioning node if random is ok

Steve

jeffreysdelong · February 27, 2019, 3:55pm

Great suggestions. I will check them both out.

Thanks, Jeff

beginner · February 28, 2019, 1:25pm

Or Row Sampling Node.

And the Sampling in these doesn’t need to be random. Just select linear sampling.

s.roughley · February 28, 2019, 8:45pm

@beginner - thanks - that looks like the Row Filter analogue of the Partitioning node, which also on close inspection has the same options.

I guess the final choice depends whether @jeffreysdelong wants the remainder of his dataset as a test set.

Steve

jeffreysdelong · March 1, 2019, 4:40am

Thanks for all the suggestions. I wound up using Andrew’s suggestion.

The input files have gene expression data in a column. I wanted to reduce the amount of data from 60K to 20K to make it more manageable. I then transpose the column to a row so I can add additional patients data.

beginner · March 1, 2019, 8:41am

Fine if it works for you but column expressions is very slow. Using partitioning or row sampling will be faster.