Integrating a citation manager full database (Zotero) into Knime text processing

Hi,

My full database in Zotero (citation manager software) has around 10GB, 5.000 PDFs, with documents from 20 to 1.000 pages.

Actually, everyting is inside a Windows folder, with several subfolders, but just 2 level hierachy. I mean, Database => Documents container folders.

a) Is there any way I could link the main Database folder into Knime having in mind to access the PDFs in the subfolders and convert them into Knime recognizable documents?

b) For lenghty documents (since they have from 20 to 1.000 pages), is the full content integrated/imported by knime?

My intention is to proceed with deeper text searches that the citation managers aren't able to handle as Knime can.

Many thanks,

Cadu

Hi,

about a) The PDF Parser node can search and parse directories recursively. If you have one main folder containing subfolders, that contain the PDFs you can use that node to parse them.

about b) The PDF Parser node is using internally the Apache POI lib to parser the documents. Text will be recognized as long as it is text and not an image. Tables and images etc, will not be recognized. Documents with 1000 pages will be a challange, i guess. I have never tried to parse documents of that size. Make sure to start KNIME with enough xmx memory. You can specify this in the knime.ini file.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.