Bag of Words (BoW) with multiple words in one Term

Hi!

 

I have an input-file similar to the following table, represeting recipes with an ID, the cuisine that recipe belongs to and a list of the needed ingredients:

ID Cuisine Ingredients
1111 greek black pepper, water, large eggs, sugar, corn oil, feta cheese crumbles
2222 southern_us lemon, pesto, whipping cream, melted butter
3333 indian coarse salt, urad dal, potatoes, white rice, vegetable oil

I would like to create a model that can predict the cuisine for new recipes based on the used ingredients.

Is there a way to create a "Bag of ingridients" instead of a Bag-of-words? So I want the ingredients, for example "vegetable oil" to be treated as one term and not seperate as "vegetable" and "oil". In addition I need it to be possible to still use all the preprocessing nodes on the terms.

I already tried a lot of stuff, but so far nothing seemed to do the trick... For example. I tried using the dictionary tagger to tag said terms, but thats quite complicated because there are thousands of recipes. Furthermore the terms are often tagged wrong, because for example if there is the single word "vegetable" and the word "oil" in a recipe, "vegetable oil" will be tagged as two seperate words.

Anyone got some ideas for a nice solution?

Thanks in advance for your help,

 

carpa_jo

 

EDIT: Is it maybe possible to tell the BoW-Node to seperate the terms on every comma instead of every space?

You could split the column "ingredients" on the comma (",") (cf. Column Splitter Node).

Then:

  • either transform the obtained categorical / string variables into dummies (cf. One To Many Node) - that would give you the document-term matrix. This shape is fair enough for NaiveBayes or knn (using binary distances);
  • or use Pivot / Unpivot Nodes to restructure the table to long format. That would give you a sort of bag of ingredients.

It all depends on the classifier type that you intend to use.

Hi Geo,

Thanks a lot for your input!

I guess you mean the Cell Splitter node instead of the Column Splitter Node, right?

The second way you suggested (Unpivot Node) is one of the experiments that I already did. It helps me to exctact the terms as I wish, but it's impossible for me to continue using preprocessing-Nodes like Case Converter and especially calculating the tfidf-values seems to be a problem. First of all it seems like the preprocessing steps don't work for the terms and second is that a lot of the TF-Values are not calculated, I have no idea why... Do I have to create a link between the terms and the documents? If so: how?

In the attachement you can see a workflow, on top of the workflow you see the process using b-o-w (I know, the accuary of the decision tree model sucks, but this is just a quick example, using only part of the data etc.) and on the bottom you find the same process trying to implement the "bag-of-ingredients" including the problems I just named.

Ideas how to fix this issues? Any help is very much appreciated.

 

Thanks in advance.

Here’ another idea: Replace white space with a special character, e.g “_”, then the comma with white space. That should allow you to use bow. Btw you don’t need bow for preprocessing, only for Tf.

PS: Also try your text representation against simple BoW and bi- or trigrams (on words not characters, Palladian text classifier implements this method seamlessly, not BoW needed) representations. They are there for a reason. Not all ingredients will contribute equally to the cuisine. You even have a case for tfidf as “rare” ingredients certainly help identify the cuisine.

Thanks again for your input!

 

First of all: Thanks for the Palladian-hint. Nice KNIME-Extention, easy to implement. I tried different ways, bi- and trigrams etc. but sadly the accuracy was not as good as a simple BoW.

Great idea to replace the white space with a special character. The problem is: If I do the replacing-part before preprocessing, nodes like a stemmer for example don't work anymore(same goes for other preprocessing-nodes). They only recognize the last word of a term if there is no space in between. An Example:

original ingredient: roasted peanuts
replaced ingredient: roasted_peanuts
ingredient after stemming: roasted_peanut
what it should be like: roast_peanut

So I thought about doing the replacing-part after all pre-processing steps. This solves the problem I just described but it creates a new one. The problem in this case is: The preprocessing-nodes need a document-format. If I use string manipulation on a document, it's format is changed back to string. But with strings the BoW will not work. Converting it back to document type after the manipulation prints out this error:
"Configure failed (IllegalArgumentException): Table specs to join contain the duplicate column name "Document" at position 0 and 0."

I am very greatful for the help you already provided but I'll have to ask again: Any ideas how to fix this?

Thanks in advance!

There is an other way to make this clustering, you can use latent semantic for this, there is a node that makes LDA.

You must try it.

Have you tried the Replacer preprocessing node instead of string manipulation. It may be worth a shot.

Regarding the “failure” of n-grams, that merits more attention. Did you apply the Palladian text classifier on a string column or on a document column? I’ve noticed that it performs not so well on string columns, don’t know why…

EDIT: Regarding the error message you’ve got from the back and forth document transformation, I think that’s worth reporting to Kilian.

Hi,

the text processing nodes might not be suitable here since the data is not really completely unstructured text (is it?). To me it seems that you can do all you need without text processing nodes.

  1. Splitting strings at "," should work to get the ingredients
  2. Unpivoting to get all ingredients in one column and create a "bag of ingredients"
  3. String Manipulation to do case conversion on strings (stemming can not be applied)
  4. Grouping to count ingredients per recipe and in how many recipes each ingredient is used
    1. "IDF" can be computed from that
    2. Note that TF is not reasonable here. Each ingredient is contained only once in each recipe
  5. Use Pivoting node to create vectors
    1. Group by ID
    2. Use ingreditens as pivots
  6. Replace missing values by 0
  7. Train and score model

Attached is an example workflow.

Cheers, Kilian

1 Like

Mr. Kilian hi

I tried this way but it doesnt give me accurate classification, as all the testing set are classified with the same class

what should i do?