Converting String to Nominal attributes

Hi everybody,
I’m trying to run the Weka NaiveBayes classifier node on a data set with string attributes, without success. If I run it on Weka it worls, but on KNIME SDK 2.0.1 it doesn’t. Even if the native KNIME NaiveBayes works on string attributes, I need the Weka NaiveBayes implementation… Is there any way to convert String attributes to Nominal attributes ?

I tried with the Rename node, changing attributes from StringValue to NominalValue, but the resulting data set attributes are not "Nominal"s, they are:

Non-Native [interface org.knime.core.data.NominalValue, interface org.knime.core.data.StringValue, interface org.knime.core.data.DataValue]

I hope you know how to solve this problem
Thank you again
Cheers
Carmelo

Have you tried the Domain Calculator node? It collects the nominal values of the String column and attaches it to the DataTableSpec.

I tried to put the Domain Calculator node before and also after the Rename node, but if I convert from StringValue to NominalValue I get the same resulting attributes Column Type:

Non-Native [interface org.knime.core.data.NominalValue, interface org.knime.core.data.StringValue, interface org.knime.core.data.DataValue]

How can I do to properly convert strings to nominals ?

Thank you again
Cheers
Carmelo

Ciao Carmelo,

Nominal values don’t really exist as a seperate type in KNIME. Strings (and some other types) can behave like nominals if they don’t have too many different values. It sounds as if in your case your StringCell-based column has too many values and hence KNIME stops listing those values which makes the column “non nominal”. You can adjust those settings in the Domain Calculator, though.
(at the bottom: restrict number of possible values - try increasing the counter or uncheck the box).

You will not see this as a different type in the DataSpec but you will notice that KNIME now lists the possible values.

Fami sapere se non funziona.
Cheers, Michael

Thank you for your answer Michael.
The Weka NaiveBayes classifier cannot handle String attributes. You said that Strings can behave like Nominals if they don’t have too many different values… I tried to uncheck the box on the Domain Calculator node, but after using the Rename node I always get the same attribute Column Type:

Non-Native [interface org.knime.core.data.NominalValue, interface org.knime.core.data.StringValue, interface org.knime.core.data.DataValue]

How can I do?
Thank you again for your help
Cheers
Carmelo

Michael dai un’occhiata al mio blog, ciao :slight_smile:

Sure WEKA’s Naive Bayes node can use String attributes. The problem in your case seems to be, that the columns do not have possible values assigned. As Michael and Fabian have already written, you need to use the Domain Calculator if there are more than 60 different values in a column and the preceding node needs to be executed, otherwise no possible values will be available (and thus WEKA complains).

If I uncheck the “Restrict number of possible values” of the Domain Calculator node, the WEKA Naive Bayes node works properly. But now I get the following error with the Weka Predictor node:

“ERROR Weka Predictor Execute failed: Loc column has more possible values in test data than in the training data.”

I tried fixing the number of possible values to 60 (and greater values) for both the Domain Calculator node for the training set and the Domain Calculator node for the test set, but doing that it says:

“ERROR NaiveBayes Execute failed: Weka classifier can not work with given data. Reason: NaiveBayes: Cannot handle string attributes!”

What would you suggest to do? Thank you for your help
Cheers
Carmelo

Well, that seems to be a problem with your data. If the test data has more attribute values than the training data this cannot work. You could filter out all rows from the test data that have values not contained in the training data (though I have no concrete idea how to achieve this at the moment, maybe using the reference row filter or similar).
Fixing the number of possible values - even if the Weka would accept this - wouldn’t help either, because the 60 values could be different in both data sets.

I found that the problem is not the number of values contained in the test set, but it’s due to values present in the test set and don’t present in the training set…
It is a frequent problem, that I usually solve creating train and test set starting from the same ARFF file, and applying later a filter to split into train-test, so that train and test sets have both the same possible values.
In KNIME I solved this problem creating train-test sets from the same Database Query node, splitting them on a next step with the Row Filter node

The Partitioning node is provided for exactly this purpose :wink:

Yes Fabian, but in this case training and test sets cannot be created splitting data in that way.
Because for the training set I need data from January to February, and for the test set I need March data… so I need to run a SQL query to create these data sets…

You are right. Perfect. I assume that everything works now?

Yes it works… :slight_smile:
Thank you very much Fabian

http://carmelosaffioti.blogspot.com

Ciao Carmelo,

I'd like to ask you if you still have an example from this discussion.  I am learning Knime and was delighted to find finally someone who has tackled exactly what I need to do!

When you said

Yes Fabian, but in this case training and test sets cannot be created splitting data in that way.
Because for the training set I need data from January to February, and for the test set I need March data... so I need to run a SQL query to create these data sets...

I was hopeful for an answer in what followed, but Fabian's response seems to indicate that something else transpired or got left out.  At any rate, I don't understand what nodes I should use and in what order they should occur.  Could you help me out?

Many thanks in advance!

The general problem is that both, Weka Learner and Predictor, need to know and have the full domain (including all nominal values). Form the previous example it seems that those datasets are drawn from a different source - in order to get the complete domain for both datasets, you need to combine them first and split them up again. I would suggest using a dummy column to flag the training and test set and then combine those two datasets using the Concatenate node. The most important node to determine the full domain is the Domain Calculator that allows computing all nominal values (per default: only the values for nominal column with less or equal to 60 values are stored); this can be fixed with this node. Last step, split the dataset again, for example with the Row Splitter and provide the datasets to the Weka Learner and Predictor.

Gabriel: Thanks!  The addition of a dummy column is a great idea since is appears that the Row Splitter node is not optimized for splitting on columns of type DateTime (which is most natural for me to use).

I have created the dummy variable and successfully stuffed all the data into the Domain Calculator node.  I then use Row Splitter to pull apart training and test data.  The training data goes into the J48 node, where I get the error:

Execute failed: Weka classifier can not work with given data. Reason: weka.classifiers.trees.J48: Cannot handle string attributes!

Reviewing the Preliminary Attribute Check inside the Weka J48 node reveals that all my DateTime columns have a message in red text next to them:

weka.classifiers.trees.J48: Cannot handle string attributes!

Do I need to cast these DateTime variables into seconds since 1970 in my SQL code and then normalize them, or is there a more graceful fix?  As before, thanks in advance for any hints!  Bill

Not sure if this was already answered above, But I face the same problem of having too many values and it couldn't identify them as nominal values.

 

So did my first creating a "Groupby" node only on the column of the categories/strings, connecting that to the 2nd port of "Edit Nominal Domain (Dictionary)" in the original data to the first port. Work like a charm. (I used it to colorcode for a scatterplot but The data set was quite huge and I think that's where KNIME wasn't able to identify this column as nominal values.