New Document Parsers

richards99 · February 6, 2011, 12:45pm

Is it possible to have a larger array of Document Parsers, I'm finding the DML/SDML and plain ASCII parser very limited. I cannot even find a way to convert documents into DML/SDML.

Ideally it would be good to have a PDF parser, I appreciate not all PDF's are text readable, but such a parser would be great for those that are, and PDF's are increasingly becoming the common format.

Additionally, a Rich Text Format (RTF) would be good which retains more features than a flat ASCII file.

Its a shame that KNIMEs text processing facilities are so powerful but let down a little with the limited set of Document Parsers.

richards99 · February 27, 2011, 12:03pm

Some other Document Parsers that would be really handy is being able to read text out of MS Word (.doc .dox), MS Powerpoint (.ppt .pptx) and MS Excel (.xls .xlsx) files. This would be useful to be able to quicker gather key points and terms out of presentations, reports and such like.

kilian.thiel · March 2, 2011, 4:02pm

Sorry for inconvinience with the currently provided parsers. The Textprocessing plugin is so far still a labs-project, but growing. More Parsers will come and we already thought about Pdf, Word, or RTF parsers.

nuaaer · February 27, 2012, 7:49pm

I think it would be ideal to convert all the type of files into the same internal type like DML and then to be processed

kilian.thiel · March 8, 2012, 2:13pm

Btw. PDF and MS Word parser are available now in the new Text Processing version.

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.