Scanning web page for URLs

Hello,

I have this web page: https://iri.jrc.ec.europa.eu/scoreboard18.html

On this page, there are 2 xlsx files, which I want KNIME to be able to detect, so that I can use the relevant reader node to download the data.

Thanks,

Kind regards,
Yush

1 Like

Although I don’t understand exactly what you want; I still wanted to give an idea by guessing. There are many ways. I think the following will give you an idea of ​​what I share.

  1. Create a simple workflow like the picture below
  2. Paste the xpath path in the image into xpath and always have “dns:” at the beginning. you can find the path to the file in xpath.
  3. You get the file download link.


image
image
image

6 Likes

Thanks for your quick response umutcankurt,

Apologies, I should be been more clear. I was looking for something like

input:
https://iri.jrc.ec.europa.eu/scoreboard18.html

output: https://iri.jrc.ec.europa.eu/documents/10180/1771724/R%26D%20ranking%20of%20the%20world%20top%202500%20companies.xlsx

https://iri.jrc.ec.europa.eu/documents/10180/1771724/R%26D%20ranking%20of%20EU%20top%201000%20companies

I tried to replicate what you suggested, but my href column comes out blank.

Scanning web page for URLs.knwf (13.4 KB)

Hi Yush,

the trick is to get the XPath expression right (this can be a bit fiddly with the XPath node, as it involves some trial-and-error). Anyways, here’s one possible solution:

I’m using the following XPath expression to grab the links to MS Excel files:

//dns:a[dns:img[contains(@class,"xls_icon_mini")]]/@href

Explanation in prose:

Get the href attribute of all <a> tags which contain an <img> tag which has a class attribute which contains xls_icon_mini:

image

(the dns: prefixes are necessary because XPath is very strict in regards to name spaces and dns: represents the http://www.w3.org/1999/xhtml namespace; see tab “Namespace” in the XPath dialog)

Does this help?

– Philipp

7 Likes

Thanks qqilihq,

This is perfect, I can now loop through multiple pages, and extract all the Excel files from them.

Thanks to umutcankurt also.

Kind regards,
Yush

4 Likes

Sorry to essentially re-open this, but it seems not to work with the 2015 version of this page. Please see workflow.

Scanning web page for URLs.knwf (13.5 KB)

On this page, the structure is slightly different (someone should tell those IRI guys to get a proper CMS). The <img> tag is not within the <a>, so you’ll need to modify your XPath.

For the second case, the XPath would look as follows:

//*[dns:img[contains(@class,"xls_icon_mini")]]/dns:a/@href

In case you want to combine both queries to make them more generic (so that they work on both pages), you should be able to simply combine them with a |:

//dns:a[dns:img[contains(@class,"xls_icon_mini")]]/@href | //*[dns:img[contains(@class,"xls_icon_mini")]]/dns:a/@href

– Philipp

3 Likes

Amazing! Thank you again.

Kind regards,
Yush

Hi @Yush,

Following the great solutions by @umutcankurt and @qqilihq and since @qqilihq has mentioned:

Here I’d like to share this blog post in which it is explained how to easily find the XPath for an item in a webpage:
https://blog.statinfer.com/how-to-get-the-content-of-a-web-page-in-knime/

:blush:

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.