Table of urls pointing to pdf files, how to extract the text

I have a small knime process which has given me a table of urls which point directly to pdf files online. I want to process the pdf files further but I'm not sure how to download the actual pdf files to a folder using knime nodes.

Any pointers?

Hi mobmsc,

To download the PDFs, you need to first install the KNIME File Handling Nodes extension if you don't have them yet.

Next, in addition to your source URLs, you need to have a column that has the local target paths to which each document should be downloaded. After that, you will need two String to URI nodes to convert the paths in the source and target columns to URIs.

Finally, you can use the Download / Upload from List node to download the files. To parse the PDFs, you can then connect a Tika Parser URL Input node.

Hope that helps!

Cheers,

Roland

1 Like

Is it possible to just give a target folder where I want the pdfs to be downloaded to and keep the pdf file names the same as those online?

Is there an easy way to build a destination column with a value for each file? 

I tried the tika parser url input node as well as normalising and resolving the url beforehand but it complains about ERROR Tika Parser URL Input 0:36       Execute failed: Not a file or knime URL: '[MyValidURL'

 

Hi mobmsc,

No, it is not possible to only give a folder. However, you can create your destination column via String Manipulation node. Just join the filename with the path to the folder you want the PDFs in.

The Tika Parser URL Input node needs the URI of the downloaded file as input, it can't download online files.

Cheers,

Roland

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.