WARN TermDocumentDeSerializationUtil Serialization error: Term could not be serialized!
WARN TermDocumentDeSerializationUtil Serialization error: Document could not be serialized!
ERROR DocumentBufferedFileStoreDataCellFactory Could not create DocumentBufferedFileStoreCell for document: [...]
WARN RearrangeColumnsTable$ConcurrentNewColCalculator Unhandled exception in processFinished
ERROR Strings To Document Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!
which is... not very helpful. From the circumstances I can only deduce that something must have gone wrong while parsing the content, as all other fields are unchanged from several thousand working iterations. The content strings come from xml processors, and as these crash easily, the strings should be quite "clean". I can't see anything obvious like missing values, either. It might be something unexpected like an overly long sentence producing an overflow, as I strip a lot of punctuation in preprocessing (not the ideal usage of this node, but it gives me a little bit more control).
I can live with one document less, so it's not a major problem for me, but this looks like something bad. I don't know if the initial error is justified, but if yes, the handling might be improved. Maybe give us missing values?
can you reproduce the error or is it possible to send me the data set you ar working with, so that I can try to reproduce the error?
The full stack trace woule be really helpful here. You can adjust the log level in File->Preferences->KNIME->KNIME GUI. If you set the log level to info or debug the stack trace will be shown in the console. Alternatively the stack trace should be in the knim.log file. You can find that file in your knime workspace directory/.metadata/knime/knime.log
I fiddled around a little bit and I think I narrowed down the edge case I ran into. Here's the string that produces errors in my case, along with the log. On a first glance, the problem seems obvious, but I coudn't replicate the error with synthetic strings yet. I guess my synthetic test cases are optimized internally, but the bad string somehow is not.
Btw, the string looks this way because it's extracted from a (badly designed) website. I'm not sure how often extreme cases like this happen, but I've run into a lot of excessive whitespace in this project. I don't know if you consider websites as "proper" documents and to what extend this possibility was considered while working on the text processing nodes, so this is just a hint in case I found a gap in the underlying assumptions.
Thanks for the log and the .table file. I can reproduce the problem with that data. The data seems really "dirty" and needs some cleanup before creating documents. However, the node should be able to handle that. I will check what is happening here.
Marie-Luce Viaud, how many rows do you have and want to convert to documents?
I am using "Strings to Document" node when reading an ARFF file, so that I can do text processing. I have noticed that when the ARFF file contains even one document of approximately more than 8.5K words, then "Strings to Document" fails to execute with the following message:
WARN TermDocumentDeSerializationUtil Serialization error: Document could not be serialized!
ERROR DocumentBufferedFileStoreDataCellFactory Could not create DocumentBufferedFileStoreCell for document: 9fe42ecb-1939-447b-9fd9-9342bc6d3525
WARN RearrangeColumnsTable$ConcurrentNewColCalculator Unhandled exception in processFinished
ERROR Strings To Document Execute failed: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cell at index 0 is null!
I have increased the heap space and I have tried all memory policies in "Strings to Document" node, but nothing seems to work. Any help would be highly appreciated.
how long is the text in the column you are using as title column in the strings to document node? The document titles have an upper limit (64kb) which is due to Java restrictions. I guess that you are using a column as title column with strings that are too long. You can try to use a RowId or similar as title.
I keep getting the Execute failed: Cell at index 0 is null! error on the console when using the String to document node. I got the ‘encoded string too long’ message. How can I avoid this? log copy 2.xml (18.2 KB)
Hi There, I am encountering the same issue again and again while using string to document node. Any ways to overcome this.
I am using a tika parser before this to read series of pdf documents.
Even I am facing the same issue while using ‘strings to document’ node. I need to convert a large PDF files and I am using tika parser to do so before executing ‘strings to document’ node.
Below is the error message:
WARN Strings To Document 0:242:294 Serialization error: Document could not be serialized!
ERROR Strings To Document 0:242:294 Could not store document in cell: values
ERROR Strings To Document 0:242:294 Execution failed in Try-Catch block: Cell at index 0 is null!
Could anyone please help me here