Reusing my model classifier

Hi everybody:

Well, i'm having problems for reusing my model classifier, i did everything that is needed for constructing a model, but the problem is when i have to use it with new data, i make the same pre-processing but then i can not use my model because it doesn't have the same dataTableSpec(the same columns generated by the document vector), just because  it is new data and have different key words.

Anybody knows how to give to my new data the same format that i gave it in the preprocessing, i mean the same columns in the document vector with their respective tf-idf value ?

Thanks.

Greetings.

Hi,

Model prediction will not work if the data fields (column names) of the test set are different from the training set. You can try to rename the columns of the test set with the Column Rename node before doing the predictions.

Evert

The best approach is to work with a training term dictionary.

That is, while training the model, extract the list of the most important trained terms into a flow variable. Use this list to create a reference table to process your test / prediction set (after document pre-processing), which ensures that all the trained terms are there (impute to 0 where missing values show up) and only those terms (filter out the other terms) are there. This is the basic idea, you'll have to work out the details.

In the worst case, test observations will have all 0s in each column, in which case your predictor will probably choose the default category or whatever you wish.