[food for thought] Should NGram Creator or Bag of Words Creator be more generic?

NGram Creator allows to create a bag of n-grams. The latter's term output is a string column named `ngram`. However, this column cannot be directly used with the TF node: even if converted via String To Term, the TF node produces weights of 0. Another limiting factor is that NGram Creator does not allow for N minimum and N maximum, which strongly limits the relevance of this node, in particular in a parameter tuning context.

The current workaround for the first issue (described in an example) involves using NGram Creator twice, once for the creation of bag of n-grams and once for the n-gram statistics, then merging the two. This works but it is neither very intuitive nor consistent with what can be done with a BoW.

For the second issue (having a low and high N), the workaround is to generate the desired combinations of Bags of n-grams and then to concatenate them, which is not always user-friendly.

Maybe NGram Creator (or Bag of Words) should be more generic, allowing N to take the value 1, thus corresponding to a bag of words or unigrams. TermCells would be the default output (instead of the string column).

the n gram output is a string not a term. The n grams haven't been tagged as terms internally in the documents. This means they can not be counted with the TF node. You can use the same node to create a frequency output. This output contains the frequencies of the found n grams. This output can be joined to the bag of words output.

I see that it is not the most intuitive way. An alternative would be to tag the n grams as terms. This could be done  with a dictionary tagger. However, this will change the terms inside the documents which is not something that is always wanted.

Adding a setting "from" "to" makes sense to avoid usage of multiple nodes and concatenation. Creating a term column instead of strings would not solve the problem for counting, since they are not tagged as terms inside the documents.

I agree with you that generalising the n-gram creator node with a `from` and `to` range is probably the best compromise. I forgot indeed that an n-gram is not a term in the sense that one term can be present in several n-grams. I assume tagging would not allow overlapping and thus be problematic ...

As you can see in my post above, I've also noticed the presence of the two outputs for the n-gram creator node ;-)

yes, good point about the overlapping. Tagging would not not be sensible if there are overlapping n grams in the dictionary. This would overwrite the last tagged terms.

I will put the "from" "to" option on the list. Thank you for pointing this out.

