When working with different data sets to compare ML methods, I encountered some error messages with the MLP or Forest Predictor nodes in a default process build. The predictors failed as a feature was not found in the inputspec (input space).
The data are up to several thousands of columns with term frequencies in double format and a binary class in string format with no missing values.
In the MLP predictor case (default process), it failed when a feature that the Learner node had faced in the training data did not occur in the test data. I did split the data manually into (stratified) training and test set. (I also directly used only the PMML writer and reader nodes to store the learned model and apply it on manually set test data.)
This is surprising, as surely test data will often miss features or have new features!
My best assumption is that too many features were missing for the learned model to be applied (which could be possible), and the first needed feature is given in the error message.
But when I use the (randomly assigning) Partitioning node, the error does not occur even with many partition draws.
I checked this with related data sets, and the Predictor nodes seem to require a similar or the same feature space. Do I really have to create an empty data space with all possible features for the predictor nodes to work?
Has anyone of you encountered this failure or do you have any suggestions?
Thanks in advance and best regards
I can confirm my assumption after some contortions:
By concatenating all trained features as an empty feature table with test data tables, different data spaces can be used for training and testing.
The predictor seems to expect all - or many of the - learned features even if the features columns are empty.
Curiously, using the Partitioning node avoids such difficulties, but this could be due to random document selection. It is also possible that the Partitioning node passes a feature list along that is not shown as output. When using manually created partitions, then an error can occur.
well that is what I would expect of any self respecting machine learning model to demand that at least all columns are present when handling new data … if during test and training complete columns get ‘lost’ due to missing values you might consider removing them altogether or check if you could improve on your data pre-processing.
With test data you normally would apply the same transformation steps (e.g applying the trained vectorizer on the test data so the columns the model sees would always be the same. Do you know any algorithms which behave differently and allow different inputs?
I would say rather: Any self-respecting ML system. The behavior of an algorithm learning a model is a different issue.
When we implement an ML system, we typically want it to be fully controlled.
The underlying assumption is that the learned model represents the underlying “rules” based on the data, which we assume to be so rich that all needed features are sufficiently redundantly present. It is also typically assumed that all occurring features are known as you say.
Cross-validation may be used to check the consistency of the model, but cross-validation does not necessarily prove the assumptions. The data has to be studied in more depth.
Both assumptions above are very strong. Consider analyzing texts. The assumptions that you know all features in texts to be tested by studying a training corpus clearly fails. New words, terms etc. will occur in new texts one wants to analyze. This is handled by comparing the existing feature list with the to be tested features.
To do this, you need to know what an algorithm requires.
An algorithm could, for example, simply add an empty feature when encountering new data and write a warning and a list of new features. My workflow does something like this after understanding the limitations of the implemented MLP Predictor node.
Obviously, many KNIME users will not be bothered by this. I encountered a question and answered it for myself. I would still be interested to learn whether the Partitioning node silently passes a feature list along, as I assume.
in my case I am looking at how well a model of a concept translates to a different knowledge domain. The model is trained on another text corpus than the test data.
To do this, I also implemented a Naive Bayes algorithm in KNIME that is indifferent to new features. As I found, by twisting the NN workflow a little, I can also use existing nodes. It would have been easier, if I had known more about the data handling.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.