XPath node parses only first 100 rows from XML

supersharp · February 3, 2020, 4:28pm

Hi
Trying to parse an XML to get the information from multiple rows, however when running the XPath on the XML it only brings back the first 100. There should more than 400 rows - so I know something it is cutting it off. Any way to bring through all the rows? MY XPath settings are below (node, the result count here shows 0 but this was part of a loop - the other one showed 400 resultcount)
Thanks!

armingrudd · February 3, 2020, 8:20pm

Hi @supersharp,

Actually, I think it is not that easy to find out what the problem is, regarding the provided information.
Here I just try to answer regarding what I sense, if I’m getting it wrong, please provide more info or an example.
If it is a web page content in which it is said 400 items has been found, maybe the reason for getting only 100 items is that the other results in the page are not available on page load. For example they appear by scrolling down or clicking some “show more” link.

If this is the case, let me know and I will provide you with the solution.

supersharp · February 3, 2020, 8:33pm

Good catch, it appears to span multiple pages!

armingrudd · February 3, 2020, 8:37pm

So you need to get each page separately. Try it and feel free to come back with any other questions.

Siemru · May 12, 2020, 8:08am

Hi eveyone,

I actually have a similar issue. For me there is a button that says “show more” and if clicked adds more entries to the page and increases the URL by 1 “https://jobs.philips.com/jobs/#1”. So when I originally got info from the page without the increasing URL ID, I would always get the first 30 entries. I then changed it so that I provide the URLs with IDs from 1 to 10, just to check if that works. However, still each URL gets the first 30 results.Web_Extraction_Test.knwf (17.9 KB)

Is there a way I can fix that?

The workflow is attached.
Thank you!

armingrudd · May 12, 2020, 10:02am

Hi @Siemru and welcome to the KNIME forum,

In your case, pagination is event based so you cannot catch all results using HTTP or Webpage Retriever nodes (or the Get Request node).
You can get 60 results at most. First 30 results appear as “direct” class and the next 30 results exist as “hidden” class but you can catch them (as shown in the example workflow).
To be able to get all results (1044), you need Selenium nodes (as demonstrated in the workflow).

Web_Extraction_Test.knwf (2.6 MB)

Siemru · May 12, 2020, 11:56am

Thanks for the quick reply @armingrudd!

Works fine

system · November 11, 2020, 12:04am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.