It seems that the Ngram Creator does not respect any upstream efforts to fuse together multiple tokens into a single term. This is not a big deal for multi-part terms such as place names, as typically I filter these out prior to creating ngrams anyway. But it is annoying if the tagged terms are contractions, such as don’t or can’t. For some jobs, I like to keep these as single terms instead of breaking them up into separate parts like do and nt. I want to see plain-English ngrams like can’t wait or don’t run. But even if I tag the contracted terms upstream (using the wildcard tagger), the Ngram Creator splits them up again, instead giving me ngrams such as ca nt or do nt.
Perhaps this is intended behaviour, but for me it is very frustrating, and not how I would expect the Ngram Creator to behave. Would it be too much to ask for the Ngram Creator to at least provide the option to honour multi-part tags?
Thanks @izaychik63 , that node might indeed be useful in situations like this. Ideally I would like to extract ngrams from my tagged documents rather than from the raw strings, but I suppose I could use the string versions of the documents to find the ngrams, and then tag them in the tokenised documents.
So that might pass as a workaround. I’ll report back if it is successful. Still though, I’d love to see the Ngram Creator (and some other text processing nodes) have more flexibility in how they manage tagged terms!
Alas, it appears that the N-Gram Extractor node does not offer a solution to this issue!
I just gave it a try, and found that it splits the string don’t into not two but three separate tokens, so you’d need a whole 3gram – don ’ t – to capture this one term!
I have managed to find the ngrams that I want by retokenising the documents before sending them to the Ngram Creator. That is, I extract the plain text from the pre-processed documents using the Document Data Extractor, then retokenise the documents with the Strings to Document node but selecting the Whitespace Tokenizer. The Ngram Creator then separates words at spaces only.
However, when I then tag these ngrams in my original documents, the Dictionary Tagger inserts a space into the contracted words. So don’t worry is tagged as do n’t worry. This behaviour reflects an issue that I reported about 8 months ago (at least for the Dictionary Replacer), but which evidently has not yet been fixed.
In the meantime, it looks like I can use the Dictionary Replacer to remove the unwanted spaces, as long as I configure it to use the Whitespace Tokenizer. This does make me nervous, however, as the documents are not tokenised this way. Is this mismatch likely to cause problems? I’ll report back if I find any.
The problem with the NGram Creator seems to be that you can either choose Word or Character as an option to create NGrams. However terms are not available. When you are trying to combine two (single-word) terms using the Dictionary Tagger, the new term will consist of two words. So in this case it does not really matter if there are one term or multiple terms as the NGram Creator will only check words.
That’s a useful explanation, thanks @julian.bunzel . I forget sometimes that words are still separate entities even when they have been tagged together, and much of my confusion around the behaviour of some nodes has probably stemmed from this. I don’t suppose a future version of the Ngram creator could have a term-based option?
I hadn’t thought of using the Term Neighbourhood Extractor in this context. I’ll look into it, but it is probably a less efficient solution than I was aiming for.