Wildcard Tagger Error

#1

Hi,
When I was trying to tag a corpus with following settings I keep getting these error messages:

ERROR Wildcard Tagger 2:35 Execute failed: String index out of range: 33
ERROR Wildcard Tagger 2:35 Execute failed: String index out of range: 72
ERROR Wildcard Tagger 2:35 Execute failed: String index out of range: 63
ERROR Wildcard Tagger 2:35 Execute failed: String index out of range: 63
ERROR Wildcard Tagger 2:35 Execute failed: String index out of range: 72
image

My corpus is about 1M documents, when I run the sameworkflow for 100K rows it works perfectly. What would be the issue here?

0 Likes

#2

Hey @caceter,

I don’t think it’s a problem related to the number of documents. I guess there might be document which can’t be processed by the tagger. This can (but shouldn’t) happen if there are some encoding issues or special characters.

It would be quite helpful, if you could detect the document or a subset of documents which leads to this error by tagging only a few documents at once. Maybe 100k rows and then another 100k rows and so on. If you can find the document, I can have a closer look to see what the actual issue is.

Cheers,

Julian

0 Likes

#3

Try running something like this in the String Manipulation node over the String version of your documents:

regexReplace($DocString$,"\p{C}"," " )

Convert the result to Document and try again. This should remove non-ASCII stuff lurking in there. Solved some problems for me.

2 Likes