download data from a URL link

Hi everyone,

 

Is there a node in KNIME to extract data from a URL link like "download" in Alteryx?

I have case where I need to pass the few URLs and get the data. 

Please help me.

Thank you.

This depends on what type of file you want to download and what you want to do with it. If it is file that can be read with one of the various reader nodes (e.g. File Reader), then you can simply use the URL directly in the reader. If it is a "binary" file than you can have a look at the "Download Files" node.

Hi thor,

 Thanks for the quick reply,

Here's my problem,

One of the columns in my Table contains the Project ID and I need to extract the Project detail for that particular Project ID from the website using a link

Say http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007159

Here in the link the last 7 letters represent a Project ID which in turn opens different pages,

For example:

If my Proj IDs are 0007158, 0007157, 0007156 and 0007155

then to get the details I need to use these URLs

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007158

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007157

http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007156

and http://www.*****************/Pages/Dashboard.aspx?ProjectID=0007155

I tried to solve it, here are the steps I followed:

1. I clubbed Proj ID with URL using Column Aggregator node (Concat)

2. Used HTML Parser.

3. Used XML to JSON.

4. Used JSON to Table, and

5 Column filter to extract the details I need.

Problem is here Html parser and JSON to Table nodes take lot of time, Please help me to solve it in a better way.

 

Thanks in advance.

Really sorry for such big explanation

 

So the URLs point to HTML pages and you want to extract these HTML pages into new tables? Then I don't see how this can be made any faster. You could skip the XML to JSON step and use the XPath node directly in the parsed HTML cells. But this doesn't make the HTML parser faster.

 

Thanks a lot Thor,

A quich help, Does the speed of the internet connection have anything to do with performance here?

 

 

Sure, if it's slow the HTML Parser will take longer to download the file before it starts interpreting the contents.

Awesome thor, thank you.

An additional side note:

As far as I get, you're inputting URLs into the HTML Parser node (and I blatantly assume, you're using the Palladian parser):

This node should show you a warning when you're using URLs as input. We rather recommend using an HttpRetriever to download the URL and then use the result column with the downloaded data as input for the HttpRetriever. Beside additional configuration options, this will show you very clearly wheter actually the downloading (which is performed by the HttpRetriever) or the parsing (performed by the HtmlParser) takes a long time.

-- Philipp

Hi,

 

How can I extract the content from a web page with URL?

Say I need to extract the content from this article with URL

http://feedproxy.google.com/~r/greenbuzz/~3/Jvc4pFxmL6U/how-your-company-can-get-serious-about-responsible-palm-oil

Please help me,

The Html parser node is not working

What means 'not working'?

The HtmlParser, as the name already states, is purely a parser. For extracting content from the parsed page you can use XPaths (manual way) or try the Palladian ContentExtractor node (heuristics-based for extracting a "main content" block from a page).

-- Philipp

Hi,

Html parser is not returning any xml file,

Earlier when I used it for other purpose it was fine, I don't know may it's because the URL(https://www.greenbiz.com/article/how-your-company-can-get-serious-about-responsible-palm-oil?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+greenbuzz+%28GreenBiz%29) has some problem.

I will try to use contentExtracter node.

 

Thanks

@qqilihq

Can you please tell me is there any problem with this link 

http://feedproxy.google.com/~r/greenbuzz/~3/Jvc4pFxmL6U/how-your-company-can-get-serious-about-responsible-palm-oil

I see no issues. It's redirecting multiple times, but the HttpRetriever handles that transparently.