Iterate through WebElements fails after first elements

ArjenEX · October 16, 2021, 3:53pm

Hello,

I’d like to ask for some help on Selenium.

Goal:
I want to navigate through a website (EZT-Online) that has a hierarchical table listing to extract information from the items which are at the lowest level of the hierarchy.

Steps taken:

After some initial navigation, I end up on the main page. Here I want to start navigating by expanding the first folder by clicking on the plus icon (works as intended).

Here, each main folder contains an unknown number of subfolder, which in turn also contains an unknown number of new items. In this case, there are 5 subfolders.

As such, I use FindElements to again find all element where nodeExpand=true since this indicates that there are subitems still to follow.

Based on the nodeID of the plus icons, I filter those that are new (Kap 01 - Kap 05).

Next, I want to iterate through the list and apply a click to them to get to next level in the three.

Issue:
The first item in the list is clicked, but it fails afterwards due to:

Execute failed: stale element reference: element is not attached to the page document

When I check the source code however after the first click, the nodeID has not changed and is still the same as was identified during the earlier FindElements.

WF: zoll_extractor_clean.knwf (39.4 KB)

I’m happy to recieve some tips and tricks on how I can approach this with Selenium, use the proper loops, position the Find Elements, iterate through them, etc.

Thanks!

AnotherFraudUser · October 17, 2021, 12:23am

Hi @ArjenEX,

could you give me a hint how to navigate to your shown page from the starting page?
In the end you just want to extract the lowest level of certain data points right?

Then I would check if I can give a suggestion with the knime get-request nodes.

But maybe @qqilihq has any suggestions regarding the problem with the selenium nodes

qqilihq · October 17, 2021, 7:11am

Hi Arjen,

I’m currently in a hotel room with just a laptop and slow connection, thus I cannot provide a full solution for now, but at least give an explanation for this one:

Execute failed: stale element reference: element is not attached to the page document

When you click on an item in the list, the entire page is completely reloaded (you can easily observe this in the browser window; you’ll see that the browser shows a global loading indicator).

Thus, the WebElement references which you extracted before the page reloaded will lose their “connection” to the page (even though they are still there from a visual perspective, they will have new internal identifiers and can no longer be accessed by Selenium).

A simple fix would be s follows:

In each iteration, only expand one section.
After that, re-run the Find Elements node, extract one further link, click this
Continue at (1) if needed

In other words: Only extract and process a single WebElement per loop iteration (there’s even a setting in the Find Elements or Click node: “[x] Extract first match only”).

Hope this helps to get you on the right track to implement this with the Selenium Nodes. Feel free to get back if you have any further question or require more detailed input!

Best regards,
Philipp

ArjenEX · October 18, 2021, 7:30am

could you give me a hint how to navigate to your shown page from the starting page?

Sure, the website is a bit scrappy because it went to the correct page when I posted it but with a fresh session it doesn’t.

In the end you just want to extract the lowest level of certain data points right?

Correct.

Thanks for responding!

ArjenEX · October 18, 2021, 7:35am

Thanks Philipp for the response!

Thus, the WebElement references which you extracted before the page reloaded will lose their “connection” to the page (even though they are still there from a visual perspective, they will have new internal identifiers and can no longer be accessed by Selenium).

I had this in the back of my mind as well but it’s good to know for sure now

In other words: Only extract and process a single WebElement per loop iteration (there’s even a setting in the [Find Elements]

Will take a look at this and redesign the process a bit.

Thanks!

ArjenEX · October 29, 2021, 3:08pm

Hi qqilihq

I’m getting closer with this, but still not quite there due to a loop issue.

To note: a general issue that I encounter is that on this particular website I have no direct way of telling which nodes have been expanded and which are stilled collapsed.

To my knowledge, I have to analyse the results of the query on all icons (Find Elements: /html/body/table/tbody/tr[*]/td/table/tbody/tr[1]/td[1]/a) to make that determination because it returns whether the element contains either

<img src="images/collapsedMidNode.gif" border="0"> or
<img src="images/expandedMidNode.gif" border="0">

I have now:
Retrieve the first icon for chapter 1 (for testing hardcoded to 1)

loop start
Click on expand
Do another find elements to again get all icons.
Filter on those icons that still have collapsedMidNode as source
Filter on the next result
loop end.

The issue is that it still crashes with the second iteration of the loop due to Execute failed: stale element reference: element is not attached to the page document

When I manually put a new click behind the end of the loop to similate the second click iteration, it works so the right selector is available.

Your advice is again appreciated

WF: zoll_extractor_cleanv2.knwf (47.7 KB)

qqilihq · October 29, 2021, 6:03pm

Hi Arjen,

I just tried your workflow locally.

Generally, you can modify your query in Find Elements to the following XPath:

//img[@src='images/collapsedMidNode.gif']/..

This will search for all collapsed folder icons and get you the parent link. Then simply click this link and continue the iteration until there are no more collapsed folders found. I did this using an Extract Table Dimension node which serves as condition to a Variable Condition Loop End (end iterating when no more results are found):

I have modified the workflow like this and have it running since ten minutes and it seems to work okay.

There’s the following gotchas:

With each expanded folder, new unexpanded folders will pop up. This seems like the (Germish) “barrel without floor”, thus the workflow will be running for quite a long time
As the page size will increase with each folder expanded, the loading time per request will grow. It will start rather quickly, but with each request it will get a bit slower (as more data needs to be transferred)
Interestingly the server doesn’t seem to put any limits to this

Still, I think that you’ll need to think of some approach to cut this into several smaller pieces so that it will scale (and eventually finish ). Probably it makes sense to process all main categories one after the other?

Anyways, I’ll attach my current version here – hope this helps to bring you further!

–Philipp

ArjenEX · November 1, 2021, 8:23am

Thanks Philipp for the assistance!

Based on other sources similar to this one, there are about 14000 items hidden in all those folders so I already knew from the start that I need to scope it eventually since it’s just impossible to do in one go.

system · November 8, 2021, 8:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.