I am looking for a solution to extract all links/ URLs from websites like WSJ, FT, Reuters, ZH etc…together (if possible) with titles of articles.
-I would like to get all the URLs of each article listed in a given page on a single table column. I have experimented with the RSS feed reader which works fine for the RSS feeds (indeed:-) but not for a webpages with multiple articles listed one after another. I know that a number of nodes (webpage retriever, GET request etc…) can retrieve a html page but I don’t know how to scrap only URLs on a given page. Can anybody advise how should I approach that?
-I would like to get all the titles in a second column . Here I guess I can change the url into a string and then split the newly formed string and manipulate it to get the title. I could probably also use the Tika url parser node. However if anyone has a better/ simpler idea I am interested.
Hi @nba , you can check what I did in my Google Search Component:
It basically submits a string to be searched in Google, just like you would on the Google page, and then extract whatever Google returns like what it returns when you submit something to search on the Google page.
Of course, I wrote it so that it extracts based on how Google returns the information - it’s specific to Google as it relies on the html code from Google. But it gives you an idea how to do the same for other sites such as Reuters, etc,
May be if you can share some URLs and show what results you want for them, we can help you
hi @nba ,
you probably don’t want to use the same (generic) strategy (i.e. the same Xpath) on each source’s site. Compare the results from the scraping of FT’s home page using a generic Xpath and a customized one
extract ALL links and related text => 560 links, mostly images and scripts
extract only links identified by Xpath
//*[@data-trackable='heading-link']/@href => 92 links/titles
(not necessarily exhaustive, just an example)
So, the pattern of the workflow may probably stay the same for each source, but the Xpath node must be customized in order to capture only links and titles related to articles
EDIT: the example Xpath retrieves links, this one retrieves the corresponding texts
I have checked your workflow and I realize that it will take me some time to understand what ‘s going on in there as there are nodes/ functions I have never encountered in my 2+ years using Knime :-). Nonetheless, this is seriously cool, thanks for that.
To give you an idea of what I am looking for, here is the URL of one of the website I am looking to screen,
https://www.zerohedge.com/ ,however it would be the same for WSJ, Reuters etc…
I would like the output of my workflow to be a Knime table with:
Column A: containing all the links of the articles
Column B: containing the title of the articles
This would allow me to parse the titles, discover the topic/ field of each article (identification of named entities and relations, tagging) and then filter only the articles matching my interest (politics, geopolitics, capital markets etc…) to extract their content and categorize them in the right folders/ buckets (columns C, D etc…)
Hopefully it makes sense. Thanks again.
Your have an excellent point. The customisation of the Xpath node seems to make a huge difference in the output’s quality. Also good to note that I will need two different Xpath queries to retrieve both the link and the title. I imagine that I will need a third to extract the article’s text and probably another one for the graphs/ images, am I right?
By any chance, could you post/ send the workflow you shown as an image? That could save me lots of time to understand the purpose/ functioning of these nodes and to put the two Xpath queries you posted in context.
Thanks A LOT for your input.
@nba Here’s a tentative workflow. I’ve added the XPATHs for the WSJ homepage. Consider it just an example and modify the XPATHs in order to identify precisely the items you need, or add new sub-workflows
20220423_get_news.knwf (38.8 KB)
To be honest, I think you won’t be able to get the full text of the articles unless you have a subscription to FT and WSJ.
The WF pretty much rocks! I went on XPath Syntax to try to understand the the Xpath queries you’ve made. Seems pretty complex but I’ll spend some time on it. I have noticed that for the link extraction within the two Xpath nodes, the part of the query “//* //@href” is the same, so I imagine that the part in between is to be changed to adapt to other sources…
Concerning the article’s full text, I already pay all the subscriptions, the project is also a way to improve the return on these subscriptions
Once again, thanks a lot for your help
Hi @nba , my component parses the HTML code manually and goes and fetch the exact content based on the HTML code.
I’ve put something similar for retrieving URLs/ Links and Titles from https://www.zerohedge.com/
It will not blindly look for
<a href="">, it will look for clearly identified links. For example, it will not retrieve ads links.
Here’s what my workflow retrieved:
Just as with my Component for Google works only for Google (because it looks for specific HTML code from Google, more precisely for specific CSS classes), the same goes for my workflow for Zerohedge. Each site has its own HTML structure and code or class names.
If I don’t look for these specific code, it will retrieve all
<a href="">, which includes links from the navigational menu, or shortcut to other websites, etc. For example, the https://www.zerohedge.com/ has 117
<a href=""> links on its page, but only 20 of them are relevant to what you are looking for.
Unless you are reading from RSS feeds, you won’t have any standard solution reading directly from different websites.
Here’s my workflow: Retrieve URLs and Titles from Zerohedge.knwf (26.8 KB)
EDIT: I am sure you can apply the same logic using XPath. This is an alternative if you are not familiar with XPath, as it is reading the content of the website as text, and then parsing the HTML text. It will most probably not be as direct either in XPath. You most probably would have to go in multiple levels, and possible multiple paths.
@nba you’re right: the queries must be customized in order to pick exactly the items you need and not, as @bruno29a said, every href in the page. My Xpath queries extract only the nodes containing headline text and link to an article.
It’s better to split the Xpath task into two nodes
- the first selects the nodes containing the articles and saves them to a column. Each cell of the column contains a node, i.e. a complete reference to an article. This query varies a lot from one source to another
- the second extracts headline text and link from each node. The Xpath query to accomplish this task will not vary too much between sources
This method is more correct than the one used in my previous workflow, because text and link are extracted from the same node and not independently from one another,which could cause misalignments
Here’s the new workflow. I’ve added Reuters
Hello @bruno29a and @duristef,
Though you have chosen different roads, both of your workflows achieve excellent results in their own elegant way. When I posted my question few days ago, I did not expect such a comprehensive help. I would like to sincerely thank both of you.
@bruno29a, studying your workflow, I have understood your last comment: it is because you specify the class of the rows that we are interested in that you manage to filter only the relevant articles’ link (using h2 class=“Article_title__Pn_Ov*” in the row filter node rather than just ). I also understand that it is crucial to identify the right HTML tags (in the cell splitter) for each website to start with.
@duristef, it took me some time to understand your last comment on the misalignment and the necessity to divide the Xpath queries within two distinct Knime nodes, identifying first the articles’ nodes and only after the text and links. I get it now. You even went the extra mile by adding Reuters!
What else can I add except that you two are pure awesomeness! I got myself a nice course in web processing/ XML
Many many thanks to both of you.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.