I am looking for a way to identify all PDF files hosted on a given domain (and subsequently download them). Is there a best practice for this?
Right now I’m using Selenium nodes to open Google and search for “filetype:pdf site:https://www.apple.com”, then open and download every result from the first five pages.
As you might imagine, this process is not very reliable (there are redundancies, some results are not even PDFs, the order is random and relevant documents might not be downloaded…) and therefore I’m looking for a better way to do this.
I’m failry new to KNIME, so any guidance is highly appreciated
Hi @zzzZZZzzz and welcome to KNIME community forum,
You can take the link information by Selenium nodes then filter the unique download links which gives you a pdf (by checking if the link ends in .pdf). Then you can easily download the files.
If you still have problem to do so, let me know and I will build an example workflow for you.
I already filtered for the .pdf extension, just thought there might be a more elegant way than using the google search. Maybe pulling a full index for a web page?
Right now I have a table containig all the relevant URLs and use “Navigate” and “Send keys” in a loop to cycle through and save them. Here as well, there must be a more elegant way
If you have already extracted the URLs by using the Selenium nodes, then just use the HTTPS Connection node and the Download node to download the files.
Use “apple.com” as the host in https connection node and the URLs in the Download node. This is so easy.
Alright, so that’s how you handle the HTTPS Connection node
After executing this, there are the right number of PDF files with the right names in the target folder. However, they are all identical in size (31.1KB) and can’t be opened (not supported or corrupted). Any idea what might be causing this?
Additional info: although I end up with multiple PDF files, the Filelist Table of the Download Node contains only one row with one of the file paths.
The problem is that you mentioned apple website but you are trying to download files from lnvtechnology.com. So you have to input this host instead of apple.com in the HTTPS Connection node.
Riiiight, I already forgot mentioning apple.com in my first post
I assumed the host was just required to point the HTTPS Connector to the www (think DNS server) and would not be used further. Thank you very much for your support!
Edit: would you mind removing the specific URL from the example? Forgot to sanitize it in the files.
Additional question: is there a more robus way to screen for PDFs on a website? I stumbled upon one website where there are 94 links to PDFs in the first 100 google results but only 5 of them point to actual files.
The remaining links lead to dead ends on the website.
Use a search engine or scrape through all pages in a domain and find all the URLs ending in .pdf.
I think the search engine method that you are already following is better. Just use this format for searching:
e.g.
site:lnvtechnology.com filetype:pdf
or
site:apple.com/education filetype:pdf
After downloading the files you can use the File Meta Info node and filter the files (Row Filter) based on the file size then delete very smile files with the Delete Files node.
One follow-up on the Download node:
For some URLs I get the error “Execute failed: Read timed out”. How can I tell the node or wokflow to just skip the current URL when this happens and carry on with the next one?