03_Document_Classification

This is a workflow for topic classification. After converting the Documents into word vectors, it becomes a traditional classification problem which can be solved using any Machine Learning supervised training algorithm. We chose a decision tree, but it could have been anything else. Metanode "Limit # keywords" artificially limits the number of extracted keywords to limit the number of produced columns. Since the dataset used here is quite small, we do not want to run the risk of lack of generalization by having too many columns vs. too few rows in the training set. Document Vector Applier node applies the word vector extracted in the training set and removes all words that might be present in the test set but not in the training set. Category To Class extracts the content in the category field of the Document and places it in a column named "class".


This is a companion discussion topic for the original entry at https://kni.me/w/YdUJ3g7iXzokQ7Gd