Predictive Classification of Headlines

Hello,

Currently I am trying to classify news headlines into categories, but I have a few questions regarding this. I hope you can help me. :)

Say my input table consists of three columns: (i) an ID (e.g., 100 companies), (ii) the text (e.g., 1,000,000 news headline), and (iii) the publication date. My goal is to sort the news headlines into predefined categories depending on the keywords used in the news headline (e.g., six categories with each five keywords).

  1. Am I correct that I manually should categorize a fraction of the data such that I can build a learning model which then predicts the remaining data? For example, by using Tree Ensemble Learner / Predictor?
  2. From what I have tested I also find that the bag of words is extracted from the data, rather than predefined. Is there a way to count only prespecified words to categorize headlines? Or is BOW > Document Vector > Column Filter quite efficient?

Thank you for your time.

Hai I am also struggling with same if u find any solution please let me know 

Hi Okki,

First of all you need to convert your strings (titles) into Documents using the Strings to Document node. On the documents you can apply the node provided by the Textprocessing extension.

You can do this in two ways.

  1. This approach requires a predefined list of keywords for each category. Use that list(s) and tag the documents using the Dictionary Tagger. For each category use a different tag value. Then filter all other terms and count the tagged remaining terms. Now identify which tag values occur the most in the document e.g. using the TF node and the GroupBy node and based on that assign a category e.g. via the Rule Engine node.

  2. This approach is based on machine learning and requires labeled data. Do basic preprocessing (e.g. stemming, filtering, …) and create document vectors for each document. Then use these vectors and the label (category) for the documents to train a predictive model e.g. a decision tree. Then run the tree on new unseen data.

Here is an example of how to do predictive modeling on documents. This example can be easily extended to e.g. predict categories: https://www.knime.com/blog/sentiment-analysis

Cheers, Kilian

1 Like