Document Classification: Model Training and Deployment

Hub · September 8, 2020, 12:41pm

The goal of this workflow is to do spam classification using YouTube comments as the dataset. The workflow starts with a data table containing some YouTube comments taken from the YouTube Spam Collection Data Set at the UCI ML Repository[1] . The data is available in the workflow directory. The comments are divided into two categories, spam and ham (non-spam). The distribution of the values in both categories is roughly equal. First, the comments are converted into documents, whose category is the class spam or ham. The documents are then preprocessed by filtering and stemming. After that, the documents are transformed into a bag of words, which is filtered again. Only terms that occur at least in 1% of the documents (at least in 3 documents) will be used as features and not be filtered out. Then the documents are transformed into document vectors. The document vectors are a numerical representation of documents and are in the following used for classification via a support vector machine. The lower part contains the deployment workflow.

This is a companion discussion topic for the original entry at https://kni.me/w/M6MfBHnVUypxQl1O

leoa69 · August 29, 2024, 9:08pm

How can I classify spam in YouTube comments, including those from YouTube Vanced, by preprocessing the data (filtering, stemming, bag of words) and using a support vector machine, while ensuring only terms appearing in at least 1% of the comments are used as features?