I’m developing a Naive Bayes classifier using the following dataset (https://www.kaggle.com/crowdflower/twitter-user-gender-classification/data).
What i’m trying to do is traing a classifier which allows me to predict the user gender based on twitter text, twitter profile description and twitter profile side color.
Since twitter text and profile description attributes are a string columns, I need to preprocessing the data before training the classifier. In order to do that, i saw that in a lot of examples is used the Strings to Document node. Then, this new column Document is preprocessed with other node like Number filter, Case converter and so on.
Since I want use more that one attributes to training my classifier, what I have to do? Should I convert into documents both string attributes (twitter text and profile description)?
Thanks in advance
What you are describing looks like a document classification task. But to be considered as such you would need to have a labeled dataset, where for each data row you have the info related to the gender. If this information is available, then you can follow the steps below.
First, you would need to convert the strings into documents.
The documents need to be preprocessed by filtering and stemming. After that, you should transform the documents into a bag of words. Before applying the Naive Bayes classifier the documents need to be transformed into document vectors. The common set of words across all Documents in the collection is the space vocabulary. Each Document then can be represented by the vector of presence/absence (1/0) or frequency for each word in the vocabulary. The collection vocabulary then generates the vector space model.
As reference you can take a look at the following example workflow available on the EXAMPLES Server: knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/02_Document_Classification
Hope that helps,