How to exclude links from the "Comments" section when scraping or analyzing scraped pages

stevelp · May 12, 2020, 8:03pm

I’m currently using the following nodes to extract links from websites:

HTTP Retriever
HTTP Result Data Extractor
HTML Parser
XPath

I’m using the XPath query “//dns:a/@href”, however I’m finding that this is pulling all links from a page, even if they are in the “Comments” section. Is anyone aware of a way to select or filter only links that are displayed in the main body of a page, and exclude links that are pasted in the “Comments” section?

Or, another option would be to exclude “nofollow” links, as most links in comments sections are nofollow. However on some websites, they make all their links “nofollow”. Even if I did decide to go down this route, I’m not really sure how to extract link attributes using XPath, although I’m sure there’s a way to do that.

Some sites use tools such as Discus and the comments are not embedded into the main HTML of the page, so they aren’t the issue. However, here’s a sample page that includes comments in the main HTML:

Any ideas on how to filter out the links in the comments section of that page?

abockstiegel · May 13, 2020, 5:33am

Hello @stevelp,

as I don’t know if I got your use case right. I assume that you’ve got a kind of blog pages with the main content followed by a block with comments of other users. If that is right you might define the XPath-Expression a little bit closer to that like //dns:article/dns:a/@href if the main content is enclosed in an article element or //dns:div[@id='main']/dns:a/@href if the main content is enclosed in a div with the id main.
If you should prefer to use the identification by the nofollow attribute you can just negate the search for it like //dns:a[not(@nofollow)]/@href or combine that with other identifying measures.

qqilihq · May 13, 2020, 5:58am

Hi there,

beside @abockstiegel’s advice, you could have a look at the Web Page Content Extractor node. It allows to extract the “main” content of a webpage automatically using several heuristics (this means, exclude headers, navigation, etc.). There are two strategies included; one which is optimized for just getting the content (sans comments), and one which is supposed to extract the comments as well.

As this works automatically, it is predestined if you need to process a list a heterogeneous URLs where it makes no sense to manually define a matching XPath for each potential page structure.

Please keep in mind though, that this will not give you 100% accuracy – but I thinks it’s still worth a try before fine-tuning your XPaths manually.

Hope this helps!
Philipp

system · April 21, 2023, 9:38pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.