StanfordNLP NE Learner crash

#1

Hi,

I’m getting the following error when using the StanfordNLP NE Learner:

Execute failed: Got NaN for prob in CRFLogConditionalObjectiveFunction.calculate() - this may well indicate numeric underflow due to overly long documents.

I’m using a slightly modified version of the ‘NER Tagger Model Training’ example to change the input and the dictionary.

Any idea what represents an ‘overly long document’ in this context?

I will try removing the longer documents and trying again but it would help to know if there is a hard limit on document length and what it is.

Cheers,
Andrew

Edit:

Here is another one:

ERROR StanfordNLP NE Learner 2:8 Execute failed: Argument array lengths differ: [class edu.stanford.nlp.ling.CoreAnnotations$TextAnnotation, class edu.stanford.nlp.ling.CoreAnnotations$AnswerAnnotation] vs. [Red]

What I’m trying to do is create a model to extract names from technical documents…so the language is not particularly natural. I’ve extracted a bunch of names using the out of the box Stanford taggers and now I’m looping these back to train a model on the source documents to see if I can improve the yield.

Edit 2:

The second error looks like it could be triggered by non-ASCII characters in the input documents. Running a regex replace to convert them to a single space seems to resolve the issue.

The first error seems to be content rather than document length driven. Restricting length does not reliably remove the issue.

2 Likes

#2

Hey Andrew,

we discovered the same problem for some documents (for the first error). The document length does not seem to be the issue although it’s in the error message. The problem also occurred with quite short documents.

Do you have any example documents that helps to investigate the issue further?
For both errors if possible.

Thank you,

Julian

0 Likes

#3

Hi Julian,

I’m on the road so it will be the middle of next week before I can do anything about this. I’ll certainly see if I can isolate documents that trigger the problem.

Does the document length reported in the log represent the document causing the issue? This will help me isolate candidates.

cheers,
Andrew

1 Like

#4

Hey Andrew,

thank you. That would be really helpful, but don’t spend too much time on this, since it might be a problem that isn’t very reproducible. It is reproducible somehow, but it occurs with different documents in the same workflow. So far, I couldn’t isolate a document where the problem always occurs. Maybe you will have better luck.
However, the document and document length is not reported in the log (at least not for the issues I have).

Cheers,

Julian

0 Likes

#5

Hey again,

I had some time to have a deeper look at the first issue and could detect the problem. The training file that is created from the documents is missing line breaks between each document. So right now, the documents are regarded as one big document which caused the problem there.

Thanks for bringing this up again, it will be fixed in the next release.

An example for the second problem would still be highly appreciated.

Cheers,

Julian

2 Likes