Scanning web page for URLs

Yush · August 20, 2019, 3:37pm

Hello,

I have this web page: https://iri.jrc.ec.europa.eu/scoreboard18.html

On this page, there are 2 xlsx files, which I want KNIME to be able to detect, so that I can use the relevant reader node to download the data.

Thanks,

Kind regards,
Yush

umutcankurt · August 20, 2019, 7:13pm

Although I don’t understand exactly what you want; I still wanted to give an idea by guessing. There are many ways. I think the following will give you an idea of what I share.

Create a simple workflow like the picture below
Paste the xpath path in the image into xpath and always have “dns:” at the beginning. you can find the path to the file in xpath.
You get the file download link.

Yush · August 21, 2019, 8:53am

Thanks for your quick response umutcankurt,

Apologies, I should be been more clear. I was looking for something like

input:
https://iri.jrc.ec.europa.eu/scoreboard18.html

output: https://iri.jrc.ec.europa.eu/documents/10180/1771724/R%26D%20ranking%20of%20the%20world%20top%202500%20companies.xlsx

https://iri.jrc.ec.europa.eu/documents/10180/1771724/R%26D%20ranking%20of%20EU%20top%201000%20companies

I tried to replicate what you suggested, but my href column comes out blank.

Scanning web page for URLs.knwf (13.4 KB)

qqilihq · August 21, 2019, 9:20am

Hi Yush,

the trick is to get the XPath expression right (this can be a bit fiddly with the XPath node, as it involves some trial-and-error). Anyways, here’s one possible solution:

I’m using the following XPath expression to grab the links to MS Excel files:

//dns:a[dns:img[contains(@class,"xls_icon_mini")]]/@href

Explanation in prose:

Get the href attribute of all <a> tags which contain an <img> tag which has a class attribute which contains xls_icon_mini:

(the dns: prefixes are necessary because XPath is very strict in regards to name spaces and dns: represents the http://www.w3.org/1999/xhtml namespace; see tab “Namespace” in the XPath dialog)

Does this help?

– Philipp

Yush · August 21, 2019, 10:33am

Thanks qqilihq,

This is perfect, I can now loop through multiple pages, and extract all the Excel files from them.

Thanks to umutcankurt also.

Kind regards,
Yush

Yush · August 21, 2019, 2:30pm

Sorry to essentially re-open this, but it seems not to work with the 2015 version of this page. Please see workflow.

Scanning web page for URLs.knwf (13.5 KB)

qqilihq · August 21, 2019, 2:50pm

On this page, the structure is slightly different (someone should tell those IRI guys to get a proper CMS). The <img> tag is not within the <a>, so you’ll need to modify your XPath.

For the second case, the XPath would look as follows:

//*[dns:img[contains(@class,"xls_icon_mini")]]/dns:a/@href

In case you want to combine both queries to make them more generic (so that they work on both pages), you should be able to simply combine them with a |:

//dns:a[dns:img[contains(@class,"xls_icon_mini")]]/@href | //*[dns:img[contains(@class,"xls_icon_mini")]]/dns:a/@href

– Philipp

Yush · August 21, 2019, 2:56pm

Amazing! Thank you again.

Kind regards,
Yush

armingrudd · August 21, 2019, 11:46pm

Hi @Yush,

Following the great solutions by @umutcankurt and @qqilihq and since @qqilihq has mentioned:

Here I’d like to share this blog post in which it is explained how to easily find the XPath for an item in a webpage:
https://blog.statinfer.com/how-to-get-the-content-of-a-web-page-in-knime/

system · August 28, 2019, 11:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.