Download PDFs from Website

mr2mike · November 22, 2021, 9:21pm

I have no idea where to start, my KNIME work has all been local with DB or XLS file manipulation or joining.
Trying to get a workflow built that can get the following public data:

Have a table of Application Numbers
Workflow would connect to the website
https://dds.aer.ca/iar_query/ApplicationAttachments.aspx?AppNumber=1828506
Download all the files (more than one per application). Added the underscore so you can see the web address.
_Application Error Encountered
_Application Error Encountered
_Application Error Encountered
_Application Error Encountered
_Application Error Encountered
Save locally the file names based on the Attachment Description in the website table. Most are PDF or doc or xls.

First I want to prove I can download a single application and all the attachments before I build in the table of applications.
I am not adverse to putting in a Python script if necessary. But some nodes ARE limited by my company.

Here’s the site layout.

Andrew_Steel · November 22, 2021, 11:09pm

Hi @mr2mike,

here is the beginning of a short solution:

I separated the host data from the request data (Table Creator and String Maipulation), because we need the host data later again.
The first GET Request fetches the application data. The Regex extracts all document numbers.

To fetch the documents we need a document query, build with host data and our extracted data, and the application query as Referer (Request Headers Tab).

At the end we have the filename, which we query with the entry Content-Disposition (Response Headers Tab), and the binary data of the document. The rest should be simple.

I hope it helps.

Best regards
Andrew

request_test.knwf (1.9 MB)

mr2mike · November 24, 2021, 9:00pm

Thank you! You definitely got me started on it. Had a bit of issue in the beginning. Your workflow is built on KNIME 4.4 and I was using KNIME 4.3
The difference was the RegEx Extractor node couldn’t handle the List that the GET Request was creating. Basically, it’s an Array. Once I figure that out. I was able to use UnGroup node to move it back to a string rather than an array and it began to work.

It also appears I need to open and search for one application on the website before running. If this doesn’t happen the GET Request is blocked or not received and retries endlessly.

system · June 2, 2023, 9:09pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.