Hi, I’m using the XPath to extract some links from XML source code that I’ve scraped from web pages. Here’s the syntax that I’m using to search for all of the links:
//dns:a/@href
However, I’ve found that certain websites seem to have large sections that are Javascript, and XPath is unable to extract the links that are embedded in javascript sections (as they are jumbled together in one large Javascript section).
Does anyone have any ideas for how I could extract links from Javascript content on webpages as well? Here’s some examples of how the links appear that I’m trying to extract:
oh, I just noticed the javascript tags didn’t show up. These are some examples of the kinds of tags that came before the text blocks that included links
Now I see. These are placed in the image caption and only show when paging through the images carousel?
This will get complicated at the end. You could either do this with Selenium and cycle through the image sequence and after each click extract the links.
Or alternatively just try to extract links additionally with a regex – XPath will not help here, as the links are not placed within an <a> tag so you’d need to process the plain HTML text. As a starting point, the Regex Extractor node has a regex template for extracting URLs which might work here.