topic extraction

Dear all,

We are working on a Kaggle dataset, 1 million headlines from abc news.

This is a datafile with news headlines, each 5-10 words long. We wish to apply the LDA top extractor to 2019 headlines only, approx. 34.000.

After enrichment and preprocessing, we end with a BoW that has a “term” column and a document column (document=the original headline). Topic extraction LDA works well.

However, what we prefer to do is to first merge the headlines by month (also as a basis for by-month TF-IDF comparison) and then run the topic extractor. The BOW now has a column with terms (identical to the one above, where headlines are not merged), the document column contains all the headlines for that month (rather than just the headline)
Now the topic extractor does not work, this seems to be a capacity problem.

I suspect that:

  • A BOW always contains -for each term in the term column- the document in the document column from which the term originates.
  • I.e. each term carries with it -in the document column- a full month of headlines.
  • Similar rationale if we merge the headlines on a weekly basis. Also then the topic extractor seems to have a capacity problem.

How to solve this problem? The heap space increase does not solve this.

Regards,

Michiel

Hey @MvBreemen,

I had a look at the data set and tried to reproduce your problem. It works fine for me.
Could it be that you are using the Topic Extractor after the Bag Of Words node? In that case the problem indeed seems to be a capacity problem. Due to applying the BoW Creator the rows are multiplied, so that each row gives one term and the corresponding document. The document column will then have the same documents over and over again. The Topic Extractor will then try to do the extraction on each document while each documents is available multiple times in the column.

I would go for two different branches, one doing the BoW and TFIDF comparison one branch doing the Topic Extraction. You don’t need a BoW to use the Topic Extractor node.

I hope this helps.

Best,

Julian

1 Like