We are working on a Kaggle dataset, 1 million headlines from abc news.
This is a datafile with news headlines, each 5-10 words long. We wish to apply the LDA top extractor to 2019 headlines only, approx. 34.000.
After enrichment and preprocessing, we end with a BoW that has a “term” column and a document column (document=the original headline). Topic extraction LDA works well.
However, what we prefer to do is to first merge the headlines by month (also as a basis for by-month TF-IDF comparison) and then run the topic extractor. The BOW now has a column with terms (identical to the one above, where headlines are not merged), the document column contains all the headlines for that month (rather than just the headline)
Now the topic extractor does not work, this seems to be a capacity problem.
I suspect that:
- A BOW always contains -for each term in the term column- the document in the document column from which the term originates.
- I.e. each term carries with it -in the document column- a full month of headlines.
- Similar rationale if we merge the headlines on a weekly basis. Also then the topic extractor seems to have a capacity problem.
How to solve this problem? The heap space increase does not solve this.