Extract links from Javascript content on webpage

stevelp · May 19, 2020, 6:11pm

Hi, I’m using the XPath to extract some links from XML source code that I’ve scraped from web pages. Here’s the syntax that I’m using to search for all of the links:

//dns:a/@href

However, I’ve found that certain websites seem to have large sections that are Javascript, and XPath is unable to extract the links that are embedded in javascript sections (as they are jumbled together in one large Javascript section).

Does anyone have any ideas for how I could extract links from Javascript content on webpages as well? Here’s some examples of how the links appear that I’m trying to extract:

<a target="_blank" rel=“nofollow” href=“https://www.example.com/”>example.com
<a title=“example.com” href=“https://www.example.com/” target="_blank">
<a href=“https://www.example.com/” target=_blank>

These are all enclosed somewhere after Javascript tags like these:

stevelp · May 19, 2020, 7:06pm

oh, I just noticed the javascript tags didn’t show up. These are some examples of the kinds of tags that came before the text blocks that included links

<script>
<script type="text/javascript">
<script type="application/javascript">

qqilihq · May 22, 2020, 8:52am

Hi Steve,

could you share an example link of one of these pages?

Thanks,
Philipp

stevelp · May 25, 2020, 3:25pm

Here are some example pages:

https://www.sentinel-standard.com/article/20131029/NEWS/310299898
https://www.newstimes.com/real-estate/article/CBRE-entity-purchases-one-of-San-Antonio-s-13207638.php
https://www.newbernsj.com/article/20140905/news/309059862

qqilihq · May 29, 2020, 4:39pm

Thanks for these – unfortunately I can only access the second URL here in Europe, the others are blocked.

https://www.newstimes.com/real-estate/article/CBRE-entity-purchases-one-of-San-Antonio-s-13207638.php

Could you briefly highlight which links you’d expect from the JavaScript element?

Thanks,
Philipp

PS: Will respond to your emails later

stevelp · June 1, 2020, 7:31pm

All of those pages link to areavibes.com. If I could pull out all of those links on those pages, that would be ideal.

qqilihq · June 2, 2020, 11:48am

Now I see. These are placed in the image caption and only show when paging through the images carousel?

This will get complicated at the end. You could either do this with Selenium and cycle through the image sequence and after each click extract the links.

Or alternatively just try to extract links additionally with a regex – XPath will not help here, as the links are not placed within an <a> tag so you’d need to process the plain HTML text. As a starting point, the Regex Extractor node has a regex template for extracting URLs which might work here.

– Philipp

system · June 9, 2020, 11:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.