Extract links on a Certain Web Page

sophiexiaozhe · November 15, 2017, 1:17am

Hi,

I am totally new to KNIME, and trying to perform some web analytical tasks using it. I am trying to extract all the articles that appeared on Reddit regarding one specific topic, for example, Facebook.

Basically, what I tried to do is to extract all the links (ursl) on the page "https://www.reddit.com/r/facebook/",including the ones when you click next page until the end. And then using the content extractor to extract all the content for each article. I have found an example workflow to work off with, but when I tried to execute the loop to fetch pages, it wasn't working properly. I am not really sure which part I should change based on my needs. I have attached the workflow I have been working on.

Any help would be highly appreciated! Thank you!

Best,

Sophie

Reddit_FB.knar.knwf

amartin · November 20, 2017, 3:24pm

Hi Sophie,

You might want to check the XPath Syntax. Also, please note, when using an XPath node and referencing an element node you have to add the namespace name specified in the Namespace tab of the configuration dialog ("namespace:element_node_name").

Please find attached a sample workflow where I extract all the links to the articles on the page as well as the link to the next page.

Best,

Anna

Reddit_FB_edited.knar.knwf