NGram Creator allows to create a bag of n-grams. The latter's term output is a string column named `ngram`. However, this column cannot be directly used with the TF node: even if converted via String To Term, the TF node produces weights of 0. Another limiting factor is that NGram Creator does not allow for N minimum and N maximum, which strongly limits the relevance of this node, in particular in a parameter tuning context.
The current workaround for the first issue (described in an example) involves using NGram Creator twice, once for the creation of bag of n-grams and once for the n-gram statistics, then merging the two. This works but it is neither very intuitive nor consistent with what can be done with a BoW.
For the second issue (having a low and high N), the workaround is to generate the desired combinations of Bags of n-grams and then to concatenate them, which is not always user-friendly.
Maybe NGram Creator (or Bag of Words) should be more generic, allowing N to take the value 1, thus corresponding to a bag of words or unigrams. TermCells would be the default output (instead of the string column).
the n gram output is a string not a term. The n grams haven't been tagged as terms internally in the documents. This means they can not be counted with the TF node. You can use the same node to create a frequency output. This output contains the frequencies of the found n grams. This output can be joined to the bag of words output.
I see that it is not the most intuitive way. An alternative would be to tag the n grams as terms. This could be done with a dictionary tagger. However, this will change the terms inside the documents which is not something that is always wanted.
Adding a setting "from" "to" makes sense to avoid usage of multiple nodes and concatenation. Creating a term column instead of strings would not solve the problem for counting, since they are not tagged as terms inside the documents.
Thank you for your feedback!
I agree with you that generalising the n-gram creator node with a `from` and `to` range is probably the best compromise. I forgot indeed that an n-gram is not a term in the sense that one term can be present in several n-grams. I assume tagging would not allow overlapping and thus be problematic ...
As you can see in my post above, I've also noticed the presence of the two outputs for the n-gram creator node ;-)
yes, good point about the overlapping. Tagging would not not be sensible if there are overlapping n grams in the dictionary. This would overwrite the last tagged terms.
I will put the "from" "to" option on the list. Thank you for pointing this out.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.