I’ve built a text classification model based on 1-3 gram variables extracted from a message box to classify a message as either ‘Yes’ or ‘No’. In development my most successful model is an SVM. Now, I want to deploy the model. However when I pass a new data set through, not all features used in the model will be present in the new data as the text message is an open forum. This causes the model to fail in execution, with the message ‘column XYZ’ not found.
Is there an easy solution to asking the model to skip or determine 0 for any variables not present? Or should I append these missing columns from a dictionary indicating 0 value?
I didn’t work frequently with text classification models but the issue is familiar to me. If you know the full list of features I’d indeed add the missing as empty columns. Can’t say if a text value of 0 might cause troubles to your model, though.
In case the list of features is dynamic you might:
save all features (column names) into a file
Read the file containing the list of features (column names) your model was trained against
Extract features (column names)
Apply reference column filter to identify the missing ones
Re-Add missing columns
The last step can be tricky because of “Non-native” data types. Hope that helps.