I’m currently using the following nodes to extract links from websites:
- HTTP Retriever
- HTTP Result Data Extractor
- HTML Parser
I’m using the XPath query “//dns:a/@href”, however I’m finding that this is pulling all links from a page, even if they are in the “Comments” section. Is anyone aware of a way to select or filter only links that are displayed in the main body of a page, and exclude links that are pasted in the “Comments” section?
Or, another option would be to exclude “nofollow” links, as most links in comments sections are nofollow. However on some websites, they make all their links “nofollow”. Even if I did decide to go down this route, I’m not really sure how to extract link attributes using XPath, although I’m sure there’s a way to do that.
Some sites use tools such as Discus and the comments are not embedded into the main HTML of the page, so they aren’t the issue. However, here’s a sample page that includes comments in the main HTML:
Any ideas on how to filter out the links in the comments section of that page?