How to read common text formats (Word, PDF, RTF, Excel)


is there a possibility to use KNIME Text Processing plug-in to process common text formats (Microsoft Word, PDF, RTF, Excel, etc)?

Dear simulyant,
there are several tools which are able to extract text from a specific data format.
However, the matter might not be fully easy to do since not all authors allow the content to be extract …
for example there are PDF where the copying of the text is not allowed (as decided by the authors),
or where there is no text since are “scanned images” (this happens a lot in the patent domain but ont only).

More in general, I would suggest to couple external text processing capabilities as available e.g. with the UIMA framework, with the powerful workflow and data mining capabilities that Knime offers and in this way not “reinvent the wheel” of NERs and relational finding but leveraging on the best available ones…

Hope this helps you (and the Knime team) .

It is not possible to read MS word or PDF files etc with parser nodes of the plugin. You can use and external tool to convert these files into txt files and read them via the File Parser. Or convert them into sdml and use the sdml parser. The benefit of converting into sdml is that the structure can be kept / transformed as well, which is lost when converting into flat txt files.

The UIMA framework sounds promising concerning other issues as well. We will check this out and see in which way we could couple this with the plugin.

Thanks, Kilian

1 Like