Search website for specific files, e.g. PDF

Hi there,

I am looking for a way to identify all PDF files hosted on a given domain (and subsequently download them). Is there a best practice for this?
Right now I’m using Selenium nodes to open Google and search for “filetype:pdf site:https://www.apple.com”, then open and download every result from the first five pages.
As you might imagine, this process is not very reliable (there are redundancies, some results are not even PDFs, the order is random and relevant documents might not be downloaded…) and therefore I’m looking for a better way to do this.

I’m failry new to KNIME, so any guidance is highly appreciated :slight_smile:

Hi @zzzZZZzzz and welcome to KNIME community forum,

You can take the link information by Selenium nodes then filter the unique download links which gives you a pdf (by checking if the link ends in .pdf). Then you can easily download the files.

If you still have problem to do so, let me know and I will build an example workflow for you.

:blush:

3 Likes

Thanks for your reply!

I already filtered for the .pdf extension, just thought there might be a more elegant way than using the google search. Maybe pulling a full index for a web page?

Right now I have a table containig all the relevant URLs and use “Navigate” and “Send keys” in a loop to cycle through and save them. Here as well, there must be a more elegant way :wink:

If you have already extracted the URLs by using the Selenium nodes, then just use the HTTPS Connection node and the Download node to download the files.

Use “apple.com” as the host in https connection node and the URLs in the Download node. This is so easy.

3 Likes

Alright, so that’s how you handle the HTTPS Connection node :smiley:

After executing this, there are the right number of PDF files with the right names in the target folder. However, they are all identical in size (31.1KB) and can’t be opened (not supported or corrupted). Any idea what might be causing this?

Additional info: although I end up with multiple PDF files, the Filelist Table of the Download Node contains only one row with one of the file paths.

1 Like

Please share your workflow and I will modify it.

That is very kind of you. I’m not sure what files you need so I zipped the whole project:

The problem is that you mentioned apple website but you are trying to download files from lnvtechnology.com. So you have to input this host instead of apple.com in the HTTPS Connection node.

:blush:

1 Like

Riiiight, I already forgot mentioning apple.com in my first post :grin:
I assumed the host was just required to point the HTTPS Connector to the www (think DNS server) and would not be used further. Thank you very much for your support!

Edit: would you mind removing the specific URL from the example? Forgot to sanitize it in the files.

2 Likes

Additional question: is there a more robus way to screen for PDFs on a website? I stumbled upon one website where there are 94 links to PDFs in the first 100 google results but only 5 of them point to actual files.
The remaining links lead to dead ends on the website.

I think you have two options:

Use a search engine or scrape through all pages in a domain and find all the URLs ending in .pdf.

I think the search engine method that you are already following is better. Just use this format for searching:
e.g.

site:lnvtechnology.com filetype:pdf

or

site:apple.com/education filetype:pdf

After downloading the files you can use the File Meta Info node and filter the files (Row Filter) based on the file size then delete very smile files with the Delete Files node.

:blush:

4 Likes

One follow-up on the Download node:
For some URLs I get the error “Execute failed: Read timed out”. How can I tell the node or wokflow to just skip the current URL when this happens and carry on with the next one?

Hi there @zzzZZZzzz,

you should use Try/Catch nodes from Error Handling sub-category. For workflow example check out this topic.

Br,
Ivan

1 Like

Thanky, I already stumbled upon the Error Handling. But I am not sure how to implement the “on error, move on to next flow variable” part.

Hi there @zzzZZZzzz ,

that is automatically done. You can check KNIME Hub for more examples on this nodes. Maybe this example workflow can help.

Br,
Ivan

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.