Search website for specific files, e.g. PDF

zzzZZZzzz · October 29, 2019, 9:09am

Hi there,

I am looking for a way to identify all PDF files hosted on a given domain (and subsequently download them). Is there a best practice for this?
Right now I’m using Selenium nodes to open Google and search for “filetype:pdf site:https://www.apple.com”, then open and download every result from the first five pages.
As you might imagine, this process is not very reliable (there are redundancies, some results are not even PDFs, the order is random and relevant documents might not be downloaded…) and therefore I’m looking for a better way to do this.

I’m failry new to KNIME, so any guidance is highly appreciated

armingrudd · October 29, 2019, 10:17am

Hi @zzzZZZzzz and welcome to KNIME community forum,

You can take the link information by Selenium nodes then filter the unique download links which gives you a pdf (by checking if the link ends in .pdf). Then you can easily download the files.

If you still have problem to do so, let me know and I will build an example workflow for you.

zzzZZZzzz · October 29, 2019, 10:25am

Thanks for your reply!

I already filtered for the .pdf extension, just thought there might be a more elegant way than using the google search. Maybe pulling a full index for a web page?

Right now I have a table containig all the relevant URLs and use “Navigate” and “Send keys” in a loop to cycle through and save them. Here as well, there must be a more elegant way

armingrudd · October 29, 2019, 10:44am

If you have already extracted the URLs by using the Selenium nodes, then just use the HTTPS Connection node and the Download node to download the files.

Use “apple.com” as the host in https connection node and the URLs in the Download node. This is so easy.

zzzZZZzzz · October 29, 2019, 10:55am

Alright, so that’s how you handle the HTTPS Connection node

After executing this, there are the right number of PDF files with the right names in the target folder. However, they are all identical in size (31.1KB) and can’t be opened (not supported or corrupted). Any idea what might be causing this?

Additional info: although I end up with multiple PDF files, the Filelist Table of the Download Node contains only one row with one of the file paths.

armingrudd · October 29, 2019, 12:06pm

Please share your workflow and I will modify it.

zzzZZZzzz · October 29, 2019, 12:18pm

That is very kind of you. I’m not sure what files you need so I zipped the whole project:

armingrudd · October 29, 2019, 1:22pm

The problem is that you mentioned apple website but you are trying to download files from lnvtechnology.com. So you have to input this host instead of apple.com in the HTTPS Connection node.

zzzZZZzzz · October 29, 2019, 1:43pm

Riiiight, I already forgot mentioning apple.com in my first post
I assumed the host was just required to point the HTTPS Connector to the www (think DNS server) and would not be used further. Thank you very much for your support!

Edit: would you mind removing the specific URL from the example? Forgot to sanitize it in the files.

zzzZZZzzz · October 29, 2019, 3:29pm

Additional question: is there a more robus way to screen for PDFs on a website? I stumbled upon one website where there are 94 links to PDFs in the first 100 google results but only 5 of them point to actual files.
The remaining links lead to dead ends on the website.

armingrudd · October 29, 2019, 6:39pm

I think you have two options:

Use a search engine or scrape through all pages in a domain and find all the URLs ending in .pdf.

I think the search engine method that you are already following is better. Just use this format for searching:
e.g.

site:lnvtechnology.com filetype:pdf

or

site:apple.com/education filetype:pdf

After downloading the files you can use the File Meta Info node and filter the files (Row Filter) based on the file size then delete very smile files with the Delete Files node.

zzzZZZzzz · November 18, 2019, 2:48pm

One follow-up on the Download node:
For some URLs I get the error “Execute failed: Read timed out”. How can I tell the node or wokflow to just skip the current URL when this happens and carry on with the next one?

ipazin · November 18, 2019, 3:32pm

Hi there @zzzZZZzzz,

you should use Try/Catch nodes from Error Handling sub-category. For workflow example check out this topic.

Br,
Ivan

zzzZZZzzz · November 18, 2019, 3:58pm

Thanky, I already stumbled upon the Error Handling. But I am not sure how to implement the “on error, move on to next flow variable” part.

ipazin · November 18, 2019, 5:32pm

Hi there @zzzZZZzzz ,

that is automatically done. You can check KNIME Hub for more examples on this nodes. Maybe this example workflow can help.

Br,
Ivan

system · May 19, 2020, 5:32am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.