download data from a URL link

Nagaraja_Ganiga · April 26, 2017, 11:02am

Hi everyone,

Is there a node in KNIME to extract data from a URL link like "download" in Alteryx?

I have case where I need to pass the few URLs and get the data.

Please help me.

Thank you.

thor · April 26, 2017, 5:29pm

This depends on what type of file you want to download and what you want to do with it. If it is file that can be read with one of the various reader nodes (e.g. File Reader), then you can simply use the URL directly in the reader. If it is a "binary" file than you can have a look at the "Download Files" node.

Nagaraja_Ganiga · April 27, 2017, 11:48am

Hi thor,

Thanks for the quick reply,

Here's my problem,

One of the columns in my Table contains the Project ID and I need to extract the Project detail for that particular Project ID from the website using a link

Say http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007159

Here in the link the last 7 letters represent a Project ID which in turn opens different pages,

For example:

If my Proj IDs are 0007158, 0007157, 0007156 and 0007155

then to get the details I need to use these URLs

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007158

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007157

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007156

and http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007155

I tried to solve it, here are the steps I followed:

1. I clubbed Proj ID with URL using Column Aggregator node (Concat)

2. Used HTML Parser.

3. Used XML to JSON.

4. Used JSON to Table, and

5 Column filter to extract the details I need.

Problem is here Html parser and JSON to Table nodes take lot of time, Please help me to solve it in a better way.

Thanks in advance.

Really sorry for such big explanation

thor · April 27, 2017, 9:27pm

So the URLs point to HTML pages and you want to extract these HTML pages into new tables? Then I don't see how this can be made any faster. You could skip the XML to JSON step and use the XPath node directly in the parsed HTML cells. But this doesn't make the HTML parser faster.

Nagaraja_Ganiga · April 28, 2017, 6:38am

Thanks a lot Thor,

A quich help, Does the speed of the internet connection have anything to do with performance here?

thor · April 28, 2017, 9:10am

Sure, if it's slow the HTML Parser will take longer to download the file before it starts interpreting the contents.

Nagaraja_Ganiga · April 28, 2017, 1:19pm

Awesome thor, thank you.

qqilihq · April 28, 2017, 3:20pm

An additional side note:

As far as I get, you're inputting URLs into the HTML Parser node (and I blatantly assume, you're using the Palladian parser):

This node should show you a warning when you're using URLs as input. We rather recommend using an HttpRetriever to download the URL and then use the result column with the downloaded data as input for the HttpRetriever. Beside additional configuration options, this will show you very clearly wheter actually the downloading (which is performed by the HttpRetriever) or the parsing (performed by the HtmlParser) takes a long time.

-- Philipp

Nagaraja_Ganiga · May 4, 2017, 8:50am

Hi,

How can I extract the content from a web page with URL?

Say I need to extract the content from this article with URL

http://feedproxy.google.com/~r/greenbuzz/~3/Jvc4pFxmL6U/how-your-company-can-get-serious-about-responsible-palm-oil

Nagaraja_Ganiga · May 4, 2017, 10:54am

Please help me,

The Html parser node is not working

qqilihq · May 4, 2017, 11:07am

What means 'not working'?

The HtmlParser, as the name already states, is purely a parser. For extracting content from the parsed page you can use XPaths (manual way) or try the Palladian ContentExtractor node (heuristics-based for extracting a "main content" block from a page).

-- Philipp

Nagaraja_Ganiga · May 4, 2017, 1:19pm

Hi,

Html parser is not returning any xml file,

Earlier when I used it for other purpose it was fine, I don't know may it's because the URL(https://www.greenbiz.com/article/how-your-company-can-get-serious-about-responsible-palm-oil?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+greenbuzz+%28GreenBiz%29) has some problem.

I will try to use contentExtracter node.

Thanks

Nagaraja_Ganiga · May 5, 2017, 12:41pm

@qqilihq

Can you please tell me is there any problem with this link

http://feedproxy.google.com/~r/greenbuzz/~3/Jvc4pFxmL6U/how-your-company-can-get-serious-about-responsible-palm-oil

qqilihq · May 5, 2017, 1:50pm

I see no issues. It's redirecting multiple times, but the HttpRetriever handles that transparently.