Dimensionality reduction in text mining


I have two datasets: one with 200k rows (training set) and another one with 2k. On second one I want to learn some product attributes based on product name. Attributes are already assigned to the larger dataset so I want to learn these attributes from smaller dataset product names based on rules learned with larger dataset. After x steps, I used a keyphrase extractor and a vectorizer so that I ended up with different terms put as 0-1 columns. Around 25k columns. I reduced it to 10k with tfidf but it's still takes too much time to even run this ML algorithm. Any ideas on how to futher reduce dimensions in a way I can process time quickly and make sure I don't lose too much information along the way?

Best regards,


Hey Mateusz,

there are different possibilities to reduce dimensionality. You could do a PCA (https://en.wikipedia.org/wiki/Principal_component_analysis) for example. 

There is also a dimensionality reduction workflow on our example server. It shows seven different ways to reduce dimensionality, so I would recommend to have a look there:


If you need further advice, just ask. ;)

Best regards,