Bug: Strings To Document crashes

The error is as follows:

WARN      TermDocumentDeSerializationUtil     Serialization error: Term could not be serialized!
WARN      TermDocumentDeSerializationUtil     Serialization error: Document could not be serialized!
ERROR     DocumentBufferedFileStoreDataCellFactory     Could not create DocumentBufferedFileStoreCell for document: [...]
WARN      RearrangeColumnsTable$ConcurrentNewColCalculator     Unhandled exception in processFinished
ERROR     Strings To Document                Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!

which is... not very helpful. From the circumstances I can only deduce that something must have gone wrong while parsing the content, as all other fields are unchanged from several thousand working iterations. The content strings come from xml processors, and as these crash easily, the strings should be quite "clean". I can't see anything obvious like missing values, either. It might be something unexpected like an overly long sentence producing an overflow, as I strip a lot of punctuation in preprocessing (not the ideal usage of this node, but it gives me a little bit more control).

I can live with one document less, so it's not a major problem for me, but this looks like something bad. I don't know if the initial error is justified, but if yes, the handling might be improved. Maybe give us missing values?

Hi,

can you reproduce the error or is it possible to send me the data set you ar working with, so that I can try to reproduce the error?

The full stack trace woule be really helpful here. You can adjust the log level in File->Preferences->KNIME->KNIME GUI. If you set the log level to info or debug the stack trace will be shown in the console. Alternatively the stack trace should be in the knim.log file. You can find that file in your knime workspace directory/.metadata/knime/knime.log

Cheers, Kilian

I fiddled around a little bit and I think I narrowed down the edge case I ran into. Here's the string that produces errors in my case, along with the log. On a first glance, the problem seems obvious, but I coudn't replicate the error with synthetic strings yet. I guess my synthetic test cases are optimized internally, but the bad string somehow is not.

Btw, the string looks this way because it's extracted from a (badly designed) website. I'm not sure how often extreme cases like this happen, but I've run into a lot of excessive whitespace in this project. I don't know if you consider websites as "proper" documents and to what extend this possibility was considered while working on the text processing nodes, so this is just a hint in case I found a gap in the underlying assumptions.

Hi,

Is there a limitation in the size of the cells on which we can perform a "String To Doucment" ?

I have this message with "big" document :

ERROR     DocumentBufferedFileStoreDataCellFactory     Could not create DocumentBufferedFileStoreCell 

Thanks a lot for your help!!

all the best 

Marie-Luce Viaud

Thanks for the log and the .table file. I can reproduce the problem with that data. The data seems really "dirty" and needs some cleanup before creating documents. However, the node should be able to handle that. I will check what is happening here.

Marie-Luce Viaud, how many rows do you have and want to convert to documents?

Cheers, Kilian

Hi,

I am using "Strings to Document" node when reading an ARFF file, so that I can do text processing. I have noticed that when the ARFF file contains even one document of approximately more than 8.5K words, then "Strings to Document" fails to execute with the following message:

WARN      TermDocumentDeSerializationUtil     Serialization error: Document could not be serialized!
ERROR     DocumentBufferedFileStoreDataCellFactory     Could not create DocumentBufferedFileStoreCell for document: 9fe42ecb-1939-447b-9fd9-9342bc6d3525
WARN      RearrangeColumnsTable$ConcurrentNewColCalculator     Unhandled exception in processFinished
ERROR     Strings To Document                Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!

I have increased the heap space and I have tried all memory policies in "Strings to Document" node, but nothing seems to work. Any help would be highly appreciated.

Kind regards,

Niki.

 

Hi pavlopoulo,

how long is the text in the column you are using as title column in the strings to document node? The document titles have an upper limit (64kb) which is due to Java restrictions. I guess that you are using a column as title column with strings that are too long. You can try to use a RowId or similar as title.

Cheers, Kilian

Hi Kilian,

Thank you for your answer. I followed your advice and it is all sorted now.

Kind regards,

Niki.

Hi,

I keep getting the Execute failed: Cell at index 0 is null! error on the console when using the String to document node. I got the ‚Äėencoded string too long‚Äô message. How can I avoid this? log copy 2.xml (18.2 KB)

Hi There, I am encountering the same issue again and again while using string to document node. Any ways to overcome this.
I am using a tika parser before this to read series of pdf documents.

Hi @Saivinod,

can you provide the error message and/or an example workflow to reproduce the issue?
Which version of KNIME are you using?

Cheers,
Julian

Hi,

Even I am facing the same issue while using ‚Äėstrings to document‚Äô node. I need to convert a large PDF files and I am using tika parser to do so before executing ‚Äėstrings to document‚Äô node.
Below is the error message:
WARN Strings To Document 0:242:294 Serialization error: Document could not be serialized!
ERROR Strings To Document 0:242:294 Could not store document in cell: values
ERROR Strings To Document 0:242:294 Execution failed in Try-Catch block: Cell at index 0 is null!
Could anyone please help me here

Hi @SaranTvivek,

is it possible for you to share the pdf that is causing these issues?
Either directly here in the forum or via PN if needed.

Best,
Julian

Hi @julian.bunzel Thank you for the response.
This issue is fixed now. :slightly_smiling_face:

1 Like

Hello @SaranTvivek
Could you please share your solution? I am having a similar issue with the table of documents.