[food for thought] Should NGram Creator or Bag of Words Creator be more generic?

NGram Creator allows to create a bag of n-grams. The latter's term output is a string column named `ngram`. However, this column cannot be directly used with the TF node: even if converted via String To Term, the TF node produces weights of 0. Another limiting factor is that NGram Creator does not allow for N minimum and N maximum, which strongly limits the relevance of this node, in particular in a parameter tuning context.

The current workaround for the first issue (described in an example) involves using NGram Creator twice, once for the creation of bag of n-grams and once for the n-gram statistics, then merging the two. This works but it is neither very intuitive nor consistent with what can be done with a BoW.

For the second issue (having a low and high N), the workaround is to generate the desired combinations of Bags of n-grams and then to concatenate them, which is not always user-friendly.

Maybe NGram Creator (or Bag of Words) should be more generic, allowing N to take the value 1, thus corresponding to a bag of words or unigrams. TermCells would be the default output (instead of the string column).

Hi Geo,

the n gram output is a string not a term. The n grams haven't been tagged as terms internally in the documents. This means they can not be counted with the TF node. You can use the same node to create a frequency output. This output contains the frequencies of the found n grams. This output can be joined to the bag of words output.

I see that it is not the most intuitive way. An alternative would be to tag the n grams as terms. This could be done  with a dictionary tagger. However, this will change the terms inside the documents which is not something that is always wanted.

Adding a setting "from" "to" makes sense to avoid usage of multiple nodes and concatenation. Creating a term column instead of strings would not solve the problem for counting, since they are not tagged as terms inside the documents.

Cheers, Kilian

 

 

Hi Kilian,

Thank you for your feedback! 

I agree with you that generalising the n-gram creator node with a `from` and `to` range is probably the best compromise. I forgot indeed that an n-gram is not a term in the sense that one term can be present in several n-grams. I assume tagging would not allow overlapping and thus be problematic ...

As you can see in my post above, I've also noticed the presence of the two outputs for the n-gram creator node ;-)

Hi Geo,

yes, good point about the overlapping. Tagging would not not be sensible if there are overlapping n grams in the dictionary. This would overwrite the last tagged terms.

I will put the "from" "to" option on the list. Thank you for pointing this out.

Cheers, Kilian

Thank you, Kilian!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.