Selenium nodes: Find and open subpages of a known web URL via term search

frank · March 22, 2016, 6:21pm

Hi

I know how to open a webpage like http://crispr-congress.com/ with the "Start WebDriver" node and know how to get the source code via "Page Source" node.

But how do I open the subpage like http://crispr-congress.com/about/speakers/ based on the known above-mentioned URL?

The thing I do not know is how is open subpages of known URLs of conferences that have something to do with the speakers of a conference. I expect that the URLs of such subpages include the term "speaker" in an URL mentioned in the source code of the main conference webpage. The original URL were extracted via a Google search with Selenium nodes.

The background of my question: I want to extract speakers from conference webpages based on an existing list of author names.

Any ideas?

Greetings, Frank

qqilihq · March 22, 2016, 10:49pm

Hi,

the usual workflow for navigating and interacting is like this:

Open a start URL (using Start WebDriver or Navigate node)
Extract an element (in your case a link) using the "Find Elements" node
Perform interaction; in your case this would be a "Click" node

For Step 2 you need to specify, how to locate the element to retrieve. For your scenario, you can go with an XPath expression such as (this will select links which contain 'speaker' in their target URL):

  //a[contains(@href, 'speaker')]

However, using a "Click" node will not work for that specific page, as the link is hidden in a submenu, which is only visible when hovering with the mouse cursor (you will receive a "Element is not currently visible" exception). Instead, you can extract the actual href attribute value using an "Extract Attribute" node and then input the target via flow variable into a following "Navigate" node. The resulting workflow looks like this:

I'm attaching the workflow to this post.

Let me know if that works for you.

Philipp

frank · March 25, 2016, 9:14pm

Hi Philipp

Your example helped me with the workflow. Thanks!

I attached my workflow to this post. It is an intersting use case for your selenium nodes.The idea is to find conferences about the CRISPR-Cas technology (gene editing) with a high number of speakers that belong to the top scientists in this area.

This workflow uses your selenium nodes as well as the text processing plugin. The list with the top scientist was generated in a separate workflow - in this example I show how this works in principle with 4 top scientists.

(The only thing that I do not like are the last three loops that I use. I use these because I browse to the main pages of the conferences and then try to find subpages with the term "speaker" if available. If I only use one loop and such a subpage is not available, I will also loose the main page in this loop. Maybe there is a better solution?)

Best,

Frank

webcrawling_crispr-cas_conferences.zip

qqilihq · March 30, 2016, 12:46pm

Wow, Frank, that looks really cool, thank you for sharing!! I'll get back to you during the next days, either via forum or email.

Best,
Philipp

system · April 21, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.