I have a set of documents where I know a category. (two categories: relevant, not relevant)
I would like to find the best combination of keywords to apply in order to predict if a new document is relevant or nor relevant.
I am not sure what would be the best way to reduce to the relevant keywords.
(The tagged set of test data is about 1.000 documents, creating 500 ths terms in the Bag of Words)
Thanks for any hints or samples.
First I thought of TF-IDF (https://en.wikipedia.org/wiki/Tf–idf) to identify so-called relevant key words, but then I realized that you actually want to classify the documents as relevant or irrelevant. For the latter, it should normally be fine to calculate TF, apply a lower threshold (visually or better through cross-validation or optimisation nodes) and maybe an upper threshold. On the internet, you could also get inspired by the famous "spam or ham classification" example.
A note on TF-IDF: this approach boosts rarer terms instead of only frequent ones. Using this weight to classify may not necessarily yield better results than TF because the former potentially filters terms which best identify documents labelled as irrelevant.
Thanks for the fast reply, Geo,
I will give it a try