Odd tokenisation and term replacement behaviour

I am using the Dict Replacer node to convert certain all-caps terms to lowercase versions. Some of the all-caps terms are tokenised such that they end in periods, e.g. “HELLO.” (I suspect this is a quirk of the OpenNLP English Word Tokenizer). When I change the capitalisation, I want to retain these trailing periods, as I do not want to change the text in any other way at this stage. But the Dict Replacer turns them into lowercase terms that are trailed by a space AND a period - e.g. the above example becomes “hello .” instead of “hello”.

The trailing periods in tokenised terms only appear after terms in all caps, so I assume it’s just a quirk of the OpenNLP tokeniser. I can generate terms without the trailing periods in the first place by using the OpenNLP Simple Tokenizer, but in other respects that tokeniser is not sophisticated enough for my needs. I’d try the Stanford NLP Tokenizer, but for some reason it just gives me the error “Execute failed: String index out of range: -1”.

Other than searching for workarounds, is there anything I can do to make this work more smoothly? Is there any explicable reason why the Dict Replacer is inserting the space before the period? Is there a likely cause for the error I am getting with the StanfordNLP tokenizer?

Thanks!

1 Like

Hi sugna,

Would it be possible for you to share the example workflow?

Thanks,
Vincenzo

Tokenisation_test.knwf (825.6 KB)
This is an excerpt from the workflow. It should be self-explanatory, but let me know if not.

1 Like

Hi @sugna,

Thanks. I’ll look into it.

Cheers,
Vincenzo

Hey sugna,

do you know what kind of character encoding is used for your files? The StanfordNLP PTBTokenizer is currently only available for UTF-8 encoding and it seems that you are using something else because the string representation after applying the Document Data Extractor looks really, really weird (at least on Ubuntu). It seems that the tokenizer cannot handle that, but I will have a closer look.

The Dict Replacer problem seems to be a bug, I tried it with an own example and the behavior you have described, occurs also for me. I will have a closer look as well and create a bug ticket if needed.

EDIT: The Dict Replacer problem is definitely a bug, that has to be fixed. I will create a ticket for that. Thanks for the hint, @sugna.

Cheers,
Julian

Hey again,

I could identify the problem with the PTBTokenizer. It has nothing to do with encoding in this case, but with the properties of the PTBTokenizer itself. There are some normalization settings that we have set internally, but some don’t fit perfectly with our framework. For example for the term “U.S.”, the tokenizer creates two terms “U.S.” and “.”, but since the term “.” cannot be found in the sentence after the term “U.S.” the framework throws an exception. I will create a ticket and have a look if there is a possibility to set the PTBTokenizer, so that it fits our tokenization framework better.

Thanks again @sugna for your example workflow, it seems really helpful to discover some flaws.

Cheers,

Julian

Thanks Julian, I’m glad it helped, and even gladder to know that I’m not going crazy :slight_smile:

Regarding the encoding, I’m not actually sure what encoding I used to load that data, as the encoding setting in the CSV reader is simply set to ‘default’, as did the settings in the CSV writer that I used earlier to create the dataset. Can you tell me how to determine what is the default setting?

Dict-replacer-test.knwf (574.2 KB)
While you’re looking at the Dict Replacer, here’s another possible bug. See the attached workflow. When I try to run the replacements on all five documents, the Dict Replacer crashes. When I remove the term “)” or the two documents containing that term, it works.

This is an error that I can work around, and perhaps one that is only likely to occur if you are working with data as messy as mine. But I thought you might want to look at it anyhow.

(Oh, btw, the workflow contains a metanode for shrinking the file size of the table containing documents. I am finding that the document table often remains huge even after most rows have been filtered out. It only shrinks to a normal size when I convert the documents to strings and back again. The table in this workflow was 18mb until I ran this process, which reduced it to just 14kb. Is this a known issue?)