BoW error when processing large data

Hi,

I made ​​a simple workflow using large enough data around 150000 rows. 
The workflow consists of: 
- CSV reader 
- String to Document 
- BoW creator 

I did not find any problems when processing data in CSV reader dan String to Document, but when processing in BoW creator, Knime always generates an error: 

"AbstractDocumentFileStoreCell ERROR Could not read document. 
ERROR BoW creator Execute failed: ("NullPointerException"): null" 

Please help,

regards,

Akas

Hi,

which KNIME version are you using? Could you please send me the stack trace. You can do this by setting the log level for the KNIME GUI to debug (File->Preferences->KNIME->KNIME GUI) and copy paste the stack trace from the console, or copy paste it from the log file (..path to your workspace/.metadata/.log). Thank you.

Cheers, Kilian

Hi Kilian,

I'm using KNIME version 2.9.1

Below are the logs

DEBUG	 DocumentBufferedFileStoreDataCellFactory	 Creating file store: a8bf7fc0-b546-47f0-890d-c496bae656a8
DEBUG	 DocumentBufferedFileStoreDataCellFactory	 Creating file store: eeebd7b3-b02c-4205-866b-45f38b942eac
DEBUG	 Buffer                        	 Closing input stream on "/tmp/knime_Twitter47174/knime_container_20140310_8705330730156812379.bin.gz", 0 remaining
DEBUG	 Buffer                        	 Buffer file (/tmp/knime_Twitter47174/knime_container_20140310_5516789420242213631.bin.gz) is 5.461MB in size
DEBUG	 LRUDataCellCache              	 Closing lru data cell cache.
INFO 	 LocalNodeExecutionJob         	 Strings To Document 0:3 End execute (1 min, 6 secs)
DEBUG	 WorkflowManager               	 Strings To Document 0:3 doBeforePostExecution
DEBUG	 NodeContainer                 	 Strings To Document 0:3 has new state: POSTEXECUTE
DEBUG	 WorkflowManager               	 Strings To Document 0:3 doAfterExecute - success
DEBUG	 NodeContainer                 	 Strings To Document 0:3 has new state: EXECUTED
DEBUG	 BoW creator                   	 Configure succeeded. (BoW creator)
DEBUG	 NodeContainer                 	 BoW creator 0:2 has new state: CONFIGURED_QUEUED
DEBUG	 WorkflowManager               	 BoW creator 0:2 doBeforePreExecution
DEBUG	 NodeContainer                 	 BoW creator 0:2 has new state: PREEXECUTE
DEBUG	 WorkflowManager               	 BoW creator 0:2 doBeforeExecution
DEBUG	 NodeContainer                 	 BoW creator 0:2 has new state: EXECUTING
DEBUG	 LocalNodeExecutionJob         	 BoW creator 0:2 Start execute
DEBUG	 WorkflowFileStoreHandlerRepository	 Adding handler 9df95260-aee3-4a30-af7c-2a6f2ee5eb86 (BoW creator 0:2: <no directory>) - 3 in total
DEBUG	 Buffer                        	 Opening input stream on file "/tmp/knime_Twitter47174/knime_container_20140310_5516789420242213631.bin.gz", 1 open streams
DEBUG	 Buffer                        	 Opening input stream on file "/tmp/knime_Twitter47174/knime_container_20140310_8705330730156812379.bin.gz", 1 open streams
ERROR	 AbstractDocumentFileStoreCell 	 Could not read document.
DEBUG	 AbstractDocumentFileStoreCell 	 Could not read document.
java.io.FileNotFoundException: /tmp/knime_Twitter47174/knime_fs-Strings_To_Document_0_3-47176/000/000/adca5b7a-38c4-4db9-b43f-631185231225 (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.knime.ext.textprocessing.data.filestore.AbstractDocumentFileStoreCell.readDocumentData(AbstractDocumentFileStoreCell.java:236)
	at org.knime.ext.textprocessing.data.filestore.AbstractDocumentFileStoreCell.getDocument(AbstractDocumentFileStoreCell.java:183)
	at org.knime.ext.textprocessing.data.filestore.DocumentBufferedFileStoreCell.getDocument(DocumentBufferedFileStoreCell.java:1)
	at org.knime.ext.textprocessing.nodes.transformation.bow.BagOfWordsNodeModel.execute(BagOfWordsNodeModel.java:173)
	at org.knime.core.node.NodeModel.execute(NodeModel.java:713)
	at org.knime.core.node.NodeModel.executeModel(NodeModel.java:556)
	at org.knime.core.node.Node.invokeNodeModelExecute(Node.java:1069)
	at org.knime.core.node.Node.execute(Node.java:924)
	at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:418)
	at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:98)
	at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:182)
	at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:113)
	at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:331)
	at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:207)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
	at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:238)
DEBUG	 BoW creator                   	 reset
ERROR	 BoW creator                   	 Execute failed: ("NullPointerException"): null
DEBUG	 BoW creator                   	 Execute failed: ("NullPointerException"): null
java.lang.NullPointerException
	at org.knime.ext.textprocessing.nodes.transformation.bow.BagOfWordsNodeModel.execute(BagOfWordsNodeModel.java:180)
	at org.knime.core.node.NodeModel.execute(NodeModel.java:713)
	at org.knime.core.node.NodeModel.executeModel(NodeModel.java:556)
	at org.knime.core.node.Node.invokeNodeModelExecute(Node.java:1069)
	at org.knime.core.node.Node.execute(Node.java:924)
	at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:418)
	at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:98)
	at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:182)
	at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:113)
	at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:331)
	at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:207)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
	at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:238)
DEBUG	 WorkflowManager               	 BoW creator 0:2 doBeforePostExecution
DEBUG	 NodeContainer                 	 BoW creator 0:2 has new state: POSTEXECUTE
DEBUG	 WorkflowManager               	 BoW creator 0:2 doAfterExecute - failure
DEBUG	 BoW creator                   	 reset
DEBUG	 BoW creator                   	 clean output ports.
DEBUG	 WorkflowFileStoreHandlerRepository	 Removing handler 9df95260-aee3-4a30-af7c-2a6f2ee5eb86 (BoW creator 0:2: <no directory>) - 2 remaining

 

Thanks a lot! That helped to find the problem. The bad news is that it is a bug. The good news is that there is a workaround you can use to avoid the error till it has been fixed. You need to set the file store chunk size to a number greater than the number of documents you are processing. Say you are processing 200.000 documents set the File store chunk size to 210.000 or so. You can do this in the Textprocessing preferences: File->Preferences->KNIME->Textprocessing->Storage->File store chunk size

 

Btw. in your example you are applying the BoW node right after the Strings to Document node. Since Textprocessing v. 2.9 there is the "direct preprocessing" feature, allowing for the direct preprocessing of documents (no bag of word required). The direct preprocessing is much faster than the preprocessing of a bag of words. See:

http://tech.knime.org/blog/textprocessing-version-v29-released

http://tech.knime.org/blog/knime-ugm-2014-text-mining-workshop

 

Cheers, Kilian

Hi Kilian,

Is this bug has been fixed in 2.9.2?

regards,
Akas

No, it came up after the code freeze. You need to use the workaround, described above. Sorry about that.

Cheers, Kilian

HI

I am begginer to KNIME. I am doing one proof of concept for my project where i need to get data from website and process it. but i don't see "String to document" node in  2.9.2, Could you please help me.

Thanks in advanced :)

Hi,

do you have the Textprocessing extension installed? See: http://tech.knime.org/installation for how to install KNIME labs extensions.

Cheers, Kilian

Hi,

I encounter the same problem with my workflow. From node "Strings to Document" (output ~ 58'000 rows) to a "BoW creator". The node BoW fails at around 15%. I increased the file store chunk size from 10'000 to 60'000, 100'000 and even more but I'm still stuck.

Does anybody have an idea?

Cheers, dan

My Log :

DEBUG	 NodeContainerEditPart         	 Strings To Document 0:252 (EXECUTED)
DEBUG	 NodeContainerEditPart         	 BoW creator 0:51 (CONFIGURED)
DEBUG	 ExecuteAction                 	 Creating execution job for 1 node(s)...
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: CONFIGURED_MARKEDFOREXEC
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: CONFIGURED_QUEUED
DEBUG	 NodeContainer                 	 OpenDataCrowDestinationNetwork 0 has new state: EXECUTING
DEBUG	 WorkflowManager               	 BoW creator 0:51 doBeforePreExecution
DEBUG	 NodeContainer                 	 ROOT  has new state: EXECUTING
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: PREEXECUTE
DEBUG	 WorkflowManager               	 BoW creator 0:51 doBeforeExecution
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: EXECUTING
DEBUG	 LocalNodeExecutionJob         	 BoW creator 0:51 Start execute
DEBUG	 WorkflowFileStoreHandlerRepository	 Adding handler 23ba747b-0429-424b-9265-641d852bdebf (BoW creator 0:51: <no directory>) - 2 in total
DEBUG	 Buffer                        	 Opening input stream on file "C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_container_20140527_7390534739315107272.bin.gz", 5 open streams
DEBUG	 Buffer                        	 Opening input stream on file "C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_container_20140527_7797326438246340630.bin.gz", 5 open streams
DEBUG	 Buffer                        	 Opening input stream on file "C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_container_20140527_2233798473229896038.bin.gz", 5 open streams
DEBUG	 Buffer                        	 Opening input stream on file "C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_container_20140527_1863638998605099458.bin.gz", 5 open streams
DEBUG	 Buffer                        	 Opening input stream on file "C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_container_20140527_6580932695294294838.bin.gz", 5 open streams
DEBUG	 MemoryObjectTracker           	 Adding org.knime.core.data.container.Buffer$BufferMemoryReleasable (2 in total)
ERROR	 AbstractDocumentFileStoreCell 	 Could not read document.
DEBUG	 AbstractDocumentFileStoreCell 	 Could not read document.
java.io.FileNotFoundException: C:\Users\me\AppData\Local\Temp\knime_OpenDataCrowDes5985\knime_fs-Strings_To_Document-5987\000\000\b323d37b-ce22-44e9-a88b-4cc2f858280c (The system cannot find the file specified)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.knime.ext.textprocessing.data.filestore.AbstractDocumentFileStoreCell.readDocumentData(AbstractDocumentFileStoreCell.java:236)
	at org.knime.ext.textprocessing.data.filestore.AbstractDocumentFileStoreCell.getDocument(AbstractDocumentFileStoreCell.java:183)
	at org.knime.ext.textprocessing.data.filestore.DocumentBufferedFileStoreCell.getDocument(DocumentBufferedFileStoreCell.java:1)
	at org.knime.ext.textprocessing.nodes.transformation.bow.BagOfWordsNodeModel.execute(BagOfWordsNodeModel.java:173)
	at org.knime.core.node.NodeModel.execute(NodeModel.java:713)
	at org.knime.core.node.NodeModel.executeModel(NodeModel.java:556)
	at org.knime.core.node.Node.invokeNodeModelExecute(Node.java:1069)
	at org.knime.core.node.Node.execute(Node.java:924)
	at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:418)
	at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:98)
	at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:182)
	at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:113)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
	at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:238)
DEBUG	 BoW creator                   	 reset
ERROR	 BoW creator                   	 Execute failed: ("NullPointerException"): null
DEBUG	 BoW creator                   	 Execute failed: ("NullPointerException"): null
java.lang.NullPointerException
DEBUG	 WorkflowManager               	 BoW creator 0:51 doBeforePostExecution
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: POSTEXECUTE
DEBUG	 WorkflowManager               	 BoW creator 0:51 doAfterExecute - failure
DEBUG	 BoW creator                   	 reset
DEBUG	 BoW creator                   	 clean output ports.
DEBUG	 WorkflowFileStoreHandlerRepository	 Removing handler 23ba747b-0429-424b-9265-641d852bdebf (BoW creator 0:51: <no directory>) - 1 remaining
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: IDLE
DEBUG	 BoW creator                   	 Configure succeeded. (BoW creator)
DEBUG	 NodeContainer                 	 BoW creator 0:51 has new state: CONFIGURED
DEBUG	 NodeContainer                 	 OpenDataCrowDestinationNetwork 0 has new state: IDLE
DEBUG	 NodeContainer                 	 ROOT  has new state: IDLE

 

Dear Dr. Killian Thiel

Hi

I'm using KNIME  2.11.0. Is it nessessary to have a bow node after string to document node (before preprocessing nodes) or not?

I now have it but I recieve an error message:ERROR     Bag of Words Creator               Execute failed: Runtime class of object "abstract" (index 2) in row "Row1" is StringCell and does not comply with its supposed superclass DocumentBufferedFileStoreCell.

Best regards!

Hi,

since 2.9 you can apply preprocessing nodes directly on a column containing document cells. Creating a bow beforehand is not necessary anymore. The BoW Creator requires a column containing DocumentCells it can not be applied on StringCells. A possible workflow could look like:

Strings to Document->Stop word filter->Case converter-> .... ->BoW Crator->...

Cheers, Kilian