Bug: Strings To Document crashes

Marlin · January 7, 2015, 11:25am

The error is as follows:

WARN      TermDocumentDeSerializationUtil     Serialization error: Term could not be serialized!
WARN      TermDocumentDeSerializationUtil     Serialization error: Document could not be serialized!
ERROR     DocumentBufferedFileStoreDataCellFactory     Could not create DocumentBufferedFileStoreCell for document: [...]
WARN      RearrangeColumnsTable$ConcurrentNewColCalculator     Unhandled exception in processFinished
ERROR     Strings To Document                Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!

which is... not very helpful. From the circumstances I can only deduce that something must have gone wrong while parsing the content, as all other fields are unchanged from several thousand working iterations. The content strings come from xml processors, and as these crash easily, the strings should be quite "clean". I can't see anything obvious like missing values, either. It might be something unexpected like an overly long sentence producing an overflow, as I strip a lot of punctuation in preprocessing (not the ideal usage of this node, but it gives me a little bit more control).

I can live with one document less, so it's not a major problem for me, but this looks like something bad. I don't know if the initial error is justified, but if yes, the handling might be improved. Maybe give us missing values?

kilian.thiel · January 7, 2015, 4:42pm

Hi,

can you reproduce the error or is it possible to send me the data set you ar working with, so that I can try to reproduce the error?

The full stack trace woule be really helpful here. You can adjust the log level in File->Preferences->KNIME->KNIME GUI. If you set the log level to info or debug the stack trace will be shown in the console. Alternatively the stack trace should be in the knim.log file. You can find that file in your knime workspace directory/.metadata/knime/knime.log

Cheers, Kilian

Marlin · January 8, 2015, 8:40am

I fiddled around a little bit and I think I narrowed down the edge case I ran into. Here's the string that produces errors in my case, along with the log. On a first glance, the problem seems obvious, but I coudn't replicate the error with synthetic strings yet. I guess my synthetic test cases are optimized internally, but the bad string somehow is not.

Btw, the string looks this way because it's extracted from a (badly designed) website. I'm not sure how often extreme cases like this happen, but I've run into a lot of excessive whitespace in this project. I don't know if you consider websites as "proper" documents and to what extend this possibility was considered while working on the text processing nodes, so this is just a hint in case I found a gap in the underlying assumptions.

mlviaud · January 8, 2015, 4:43pm

Hi,

Is there a limitation in the size of the cells on which we can perform a "String To Doucment" ?

I have this message with "big" document :

ERROR DocumentBufferedFileStoreDataCellFactory Could not create DocumentBufferedFileStoreCell

Thanks a lot for your help!!

all the best

Marie-Luce Viaud

kilian.thiel · January 12, 2015, 7:03pm

Thanks for the log and the .table file. I can reproduce the problem with that data. The data seems really "dirty" and needs some cleanup before creating documents. However, the node should be able to handle that. I will check what is happening here.

Marie-Luce Viaud, how many rows do you have and want to convert to documents?

Cheers, Kilian

pavlopoulo · February 25, 2015, 1:11pm

Hi,

I am using "Strings to Document" node when reading an ARFF file, so that I can do text processing. I have noticed that when the ARFF file contains even one document of approximately more than 8.5K words, then "Strings to Document" fails to execute with the following message:

WARN    TermDocumentDeSerializationUtil   Serialization error: Document could not be serialized!
ERROR   DocumentBufferedFileStoreDataCellFactory   Could not create DocumentBufferedFileStoreCell for document: 9fe42ecb-1939-447b-9fd9-9342bc6d3525
WARN    RearrangeColumnsTable$ConcurrentNewColCalculator   Unhandled exception in processFinished
ERROR   Strings To Document              Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!

I have increased the heap space and I have tried all memory policies in "Strings to Document" node, but nothing seems to work. Any help would be highly appreciated.

Kind regards,

Niki.

kilian.thiel · March 2, 2015, 5:10pm

Hi pavlopoulo,

how long is the text in the column you are using as title column in the strings to document node? The document titles have an upper limit (64kb) which is due to Java restrictions. I guess that you are using a column as title column with strings that are too long. You can try to use a RowId or similar as title.

Cheers, Kilian

pavlopoulo · March 4, 2015, 12:17pm

Hi Kilian,

Thank you for your answer. I followed your advice and it is all sorted now.

Kind regards,

Niki.

alkopop79 · May 25, 2018, 7:44pm

Hi,

I keep getting the Execute failed: Cell at index 0 is null! error on the console when using the String to document node. I got the ‘encoded string too long’ message. How can I avoid this? log copy 2.xml (18.2 KB)

Saivinod · March 24, 2021, 5:59pm

Hi There, I am encountering the same issue again and again while using string to document node. Any ways to overcome this.
I am using a tika parser before this to read series of pdf documents.

julian.bunzel · March 25, 2021, 11:06am

Hi @Saivinod,

can you provide the error message and/or an example workflow to reproduce the issue?
Which version of KNIME are you using?

Cheers,
Julian

SaranTvivek · April 21, 2021, 1:24pm

Hi,

Even I am facing the same issue while using ‘strings to document’ node. I need to convert a large PDF files and I am using tika parser to do so before executing ‘strings to document’ node.
Below is the error message:
WARN Strings To Document 0:242:294 Serialization error: Document could not be serialized!
ERROR Strings To Document 0:242:294 Could not store document in cell: values
ERROR Strings To Document 0:242:294 Execution failed in Try-Catch block: Cell at index 0 is null!
Could anyone please help me here

julian.bunzel · April 22, 2021, 8:26am

Hi @SaranTvivek,

is it possible for you to share the pdf that is causing these issues?
Either directly here in the forum or via PN if needed.

Best,
Julian

SaranTvivek · April 22, 2021, 8:30am

Hi @julian.bunzel Thank you for the response.
This issue is fixed now.

Artem · May 2, 2021, 9:12pm

Hello @SaranTvivek
Could you please share your solution? I am having a similar issue with the table of documents.

system · June 2, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.