Web Scrapping with text analytics in Knime

Hi,

I want to create a workflow in knime such that am able to search the Web for the new articles related to a particular. For example i want to check out all the news articles published on knime in last one year on web.

Please help regarding the same. Many thanks in advance…

Regards
Anurag

Hi Anurag,

I think this highly depends on whether you want to use RSS feed or Web Crawling to retrieve such information.

If you want to get news through an RSS feed, then you can retrieve these within KNIME Analytics Platform through RSS Feeds Reader node. The RSS Feed Reader node takes an RSS Feed URL at the input port, connects to the RSS endpoint, and finally downloads and parses all available documents. You can find a related example workflow here: https://www.knime.com/nodeguide/other-analytics-types/text-processing/rss-feed-reader.

Another way to retrieve text data from the web is to run a web crawler. One of the KNIME Community Extension provided by Palladian offers a large number of nodes for web search, web crawling, geo-location, RSS feed, and many more. For this purpose, you may want to use the Http Retriever node, the Html Parser node, the Content Extractor node, or the XPath node. You can find an example workflow on the EXAMPLES Server at the following path: knime://EXAMPLES/50_Applications/07_Forum_Analysis_of_the_KNIME_Forum. The related whitepaper is available here: https://files.knime.com/sites/default/files/inline-images/knime_web_knowledge_extraction.pdf

Hope that helps,
Best,
Vincenzo

Hi Philipp,

Thank you for the help, that solution for zip while worked…

I have an issues regarding the selenium nodes; attached is my example workflow to help understand issue better:

Issues 1: In the workflow that have been shared here as i change text(like from NREGA to NHRM) in "Send Keys " Node, i get an errror(See Snapshot 1…). Although if i build a new workflow i get with the same changed text in Send Keys Node, it works.

Snapshot 1…

Issues 2: I have not explicitly mentioned to extract the news articles, also when i search in google (like searching for NREGA) will give the following result(See Snapshot 2…) but in workflow output was like (fie attached). the question is why even when am not configured for extracting news articles then why am i getting only the news articles in my result. Even i tried with changing the text and still got the same issues

Snapshot 2 …

loationname.xlsx (10.2 KB)

Regards
AJ