I am extremely new to KNIME and I’m trying to take articles from a website search on a particular topic and store them in a data table for text processing.
I’ve been trying to use the Applications, Forum Analysis workflow but I’m having problems adapting this to my own problem.
Any guidance on completing this task would be greatly appreciated.
Thank you in advance,
I think the easiest way to automate the search and the content collection in web is to use Selenium nodes.
Here is a quick tutorial for Selenium nodes:
Or maybe you can do this without using Selenium nodes. In that case you have to gather the search results (links) then loop over them and extract the content you need from each web page.
Here is a brief tutorial to get the content of a web page in KNIME (Palladian nodes are used in this method):
Thank you so much, Armin!!!
I will take a look at that.
The page you linked was very helpful. However, I’m not getting the output I desire.
I want to refer to the search results of a news page, such as this: https://www.nytimes.com/search?endDate=20190326&query=eric%20garner%20&sort=best&startDate=20140717
Then, I want to be able to loop through and store the article title, date, and article contents of each individual result into corresponding cells so that I may process the text in knime.
I’m unsure if this tutorial covers the aspect of looping through individual entries of a results page or what I need to do differently to ensure I have the article contents available in a table for analysis.
I used the second tutorial you linked, without using Selenium nodes. I went through the part where you inspect the element of the page you want to retrieve.
Should I link the element from the search result page or open the link to the article contents, and use that element path?
Is there a way I can get KNIME to loop over each individual search result, extract title, date, and article body, and store the information in a table so that I have all of this data for every search result it returns?
Thank you for your help!!!
Attached is an example workflow in which I have gathered the titles, dates and main contents of each search result. (Except some of them with different page structures - read below)
The web pages don’t have the same structure so it’s not possible to use a single XPath for all of them.
I think the workflow I have provided will help you to do the job. The rest of the task is on you.
web_scraping.knwf (1.6 MB)
Thank you so much for doing that. It’s very helpful, however, could you clarify some details for me?
On the workflow you attached there are two sets of retrievers, parsers, and xpaths.
I’ve decided to use the results from a site-specific google search rather than an on-site NYtimes search in order to get the max amount of article entries.
SO Please, tell me if I am getting this right.
First Xpath: Do I need the XMLpath of the url entry for the first result on the search result page or the XMLpath of the entire entry?
Second Xpath: I need the XML paths of the title, date, and content from the actual article. So, I visit the first article link on my search engine and gather the XMLs from there?
Please, let me know as soon as you can!
In the first set you get the search results and the XPath node collects the URLs to the articles.
The second set goes through each web page (the URL of each search result) and the XPath node takes the titles, dates and contents.
When I am putting in my url location, receive the following error:
ERROR XPath 7:23 Configure failed (IllegalArgumentException): There are empty namespace prefixes. Please provide a prefix for every namespace.
Here is my Xpath:
Here is my query entry, with what I thought were the name spaces:
I am relatively unfamiliar with Xpath namespace language and have probably done something wrong. I am currently trying to resolve this using online materials, but I thought I would ask you in the meantime.
When you begin your query with an id, you have to follow that element. So you should input what I have suggested above or check which element has the id of “rso” and input the elements that come after that element.
For example if the “div” element after the body in your query (/html/body/div) has the id=“rso” then your query should be like this:
I hope this would be helpful to you.
P.S. Personally I prefer to get the XPath using FireFox browser. Just follow the steps I have mentioned in that blog post to make sure your path is the right one.
Thanks Armin, you are the best!!! So helpful!! A million blessings to you and your family
By my accounts, I’ve done the workflow as you’ve said, however, I’m not getting my data to print in a table. Could you please look over my workflow and let me know what you think is wrong? I’ve attached it. The workflow in question is the third in this document. The second is the one you sent me.
KNIME_project2.knwf (70.6 KB)
Try this for the first XPath node (the last flow which had an error):
And check “Multiple rows”.
This gathers all the links in your search result.
I’m currently troubleshooting this myself but thought I’d upload it as well to see if it saves me some time.
I’ve managed to get article title, date, and contents from my workspace! Hooray!
The bad news is that I can only get all three for one result from my gathered urls. Is there something I can change in my pathways to get the data from all the urls I pulled? Have I done something wrong?
I uploaded my workspace again for your review.KNIME_project2.knwf (70.6 KB)
Thanks again, you are making all my dreams come true
I’m talking about the first workflow
I’m constructing further workflows for searches on other websites now. Getting my XPath right is really the issue! Do you have any tips for this? I’m using firefox but still having difficulty mapping to the correct items in my pages, perhaps due to my lack of understanding the language.
I’m trying to gather information on ensuring my XPaths are correct, but do you have any tips? Thanks!
Would you pls send a single flow so I know which one you mean?
I also keep receiving an error on some of my http retreiver nodes. This workflow has one and I’m curious as to why that is: news max.knwf (947 Bytes)
Just open one of those links which has a missing title and see that the page is blank. (page title: Page Not Found)
The workflow is empty.