My full database in Zotero (citation manager software) has around 10GB, 5.000 PDFs, with documents from 20 to 1.000 pages.
Actually, everyting is inside a Windows folder, with several subfolders, but just 2 level hierachy. I mean, Database => Documents container folders.
a) Is there any way I could link the main Database folder into Knime having in mind to access the PDFs in the subfolders and convert them into Knime recognizable documents?
b) For lenghty documents (since they have from 20 to 1.000 pages), is the full content integrated/imported by knime?
My intention is to proceed with deeper text searches that the citation managers aren't able to handle as Knime can.
about a) The PDF Parser node can search and parse directories recursively. If you have one main folder containing subfolders, that contain the PDFs you can use that node to parse them.
about b) The PDF Parser node is using internally the Apache POI lib to parser the documents. Text will be recognized as long as it is text and not an image. Tables and images etc, will not be recognized. Documents with 1000 pages will be a challange, i guess. I have never tried to parse documents of that size. Make sure to start KNIME with enough xmx memory. You can specify this in the knime.ini file.