Web Scraping with loop

Caro_Albers · August 19, 2024, 9:16pm

Hello everyone,

I’m working on a scraping project with knime for my studies and have had great success using selenium nodes for this. My only issue right now is the amount of “Find Elements” and “Extract Text” nodes I have to use, which are only increasing the more data I want to scrape.

So if anyone has any idea about this, is there a way to implement a loop to make my workflow more efficient? I’ve tried several ways but nothing really worked.

Scraping-exctract.knwf (115.8 KB)

Thank you in advance!

qqilihq · August 20, 2024, 7:17am

Hello Caro_Albers,

What do you consider “efficient” or, specifically, why do you consider the current version not efficient?
Generally, you have the “Find Elements” functionality within many extraction nodes, such as Extract Text, thus a Find Elements is often not needed.

-Philipp

mwiegand · August 20, 2024, 1:15pm

Hi @Caro_Albers,

I believe I understand what you want to accomplish. For each URL you’d like to extract certain information in the most efficient way which mean you want to avoid having to maintain a large amount of find nodes.

I see two approaches here:

Defining your XPath or “selector” and “selector type” in a table
Looping per URL as you do already but then loop over the rows from step 1

Or:

Extract the Page Source, which you also do already
Convert to XML (which, depending on the used tags in the website, can cause troubles like self closing tags as an example)
Use the good old XPath node and define XPaths all in one node

Both methods have advantages and disadvantages like the XPath node from Knime not coping well with changing XML Schemas and the inability to sort your XPath.

The first approach would be nice in case you have to use different selectors based on the product category since the template can vary. You might also want to inspect potential 3rd parties for split testing that could interfere with your scrape and block them.

I hope that helps.

PS: I just incurred to me that I missed to tell that there is, since the Network Dump Node had been introduced, a very convenient way of already retrieving structured data:

Most of the time, which is especially true to ecommerce or other highly optimized systems, content is fetched using XHRs. Checking the network tab in the browser console I could easily spot them:

You just need to convert the dump data from binary to string and then from string to JSON. I’d also recommend converting from JSON to XML as it’s more easy and flexible to work with.

Best
Mike

system · November 18, 2024, 1:16pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.