How to make reference to text output file from "Flat File Document Parser" by Python Node

Please help.

I am new to KNIME and Python, I would like to make use of the output from "Flat File Document Parser" node by "Python Script (1=>1)" node for subsequent processing.

Objectives:  I don't how to access various text files from "Flat File Document Parser" with "Python Script (1=>1)". 

Here is the workflow details: "Flat File Document Parser" Node connected to "Python Script (1=>1)" Node

"Flat File Document Parser" Node refers to a folder with several text documents

"Python Script (1=>1)" Node, which consists of only a column named "Document".  I would like to read the document content by the following script:

import re, sys
y=input_table['Document']  """Document is the column in Python Script (1=>1)"""
text1=''.join(str(x) for x in y)
print(text1)
f = open(text1,'r')

 

However, the print(text1) only generate the file names, not the content of the text files.

The following errors was displayed as follows:

Traceback (most recent call last):
  File "C:\Program Files\KNIME\plugins\org.knime.python_3.3.0.v201611242050\py\PythonKernel.py", line 282, in execute
    exec(source_code, _exec_env, _exec_env)
  File "<string>", line 5, in <module>
IOError: [Errno 22] invalid mode ('r') or filename: ...

Appreciated if you could help on this.

Hi Lawson,

the output of the Flat File Document Parser node are document cells. These cells contain tokenized documents including meta information, such as author, title, etc. These documents are complex objects and not only strings. The Python nodes can not make use of those complex objects. The can only process "simple" data types, such as integer, double, strings etc.

This means that you can not use the Python nodes on document cells. To process documents you need to use the nodes provided by the Text Processing extension. However, you can convert the textual data included in documents into strings by using the Document Data Extractor node. These strings can then be used by the Python nodes.

Alternatively (and most likely the better way) you can use the Tika node instead of the Flat File Parser node. This node reads text files and outputs the content as string cells.

I hope this helps.

Cheers, Kilian

Noted with thanks

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.