It seems that the Ngram Creator does not respect any upstream efforts to fuse together multiple tokens into a single term. This is not a big deal for multi-part terms such as place names, as typically I filter these out prior to creating ngrams anyway. But it is annoying if the tagged terms are contractions, such as don’t or can’t. For some jobs, I like to keep these as single terms instead of breaking them up into separate parts like do and nt. I want to see plain-English ngrams like can’t wait or don’t run. But even if I tag the contracted terms upstream (using the wildcard tagger), the Ngram Creator splits them up again, instead giving me ngrams such as ca nt or do nt. Perhaps this is intended behaviour, but for me it is very frustrating, and not how I would expect the Ngram Creator to behave. Would it be too much to ask for the Ngram Creator to at least provide the option to honour multi-part tags?

Ngram Creator splits tagged terms

izaychik63 May 16, 2021, 12:53pm 5

I can share the issues I’ve got