I’m working on a scraping project with knime for my studies and have had great success using selenium nodes for this. My only issue right now is the amount of “Find Elements” and “Extract Text” nodes I have to use, which are only increasing the more data I want to scrape.
So if anyone has any idea about this, is there a way to implement a loop to make my workflow more efficient? I’ve tried several ways but nothing really worked.
I believe I understand what you want to accomplish. For each URL you’d like to extract certain information in the most efficient way which mean you want to avoid having to maintain a large amount of find nodes.
I see two approaches here:
Defining your XPath or “selector” and “selector type” in a table
Looping per URL as you do already but then loop over the rows from step 1
Or:
Extract the Page Source, which you also do already
Convert to XML (which, depending on the used tags in the website, can cause troubles like self closing tags as an example)
Use the good old XPath node and define XPaths all in one node
Both methods have advantages and disadvantages like the XPath node from Knime not coping well with changing XML Schemas and the inability to sort your XPath.
The first approach would be nice in case you have to use different selectors based on the product category since the template can vary. You might also want to inspect potential 3rd parties for split testing that could interfere with your scrape and block them.
I hope that helps.
PS: I just incurred to me that I missed to tell that there is, since the Network Dump Node had been introduced, a very convenient way of already retrieving structured data:
Most of the time, which is especially true to ecommerce or other highly optimized systems, content is fetched using XHRs. Checking the network tab in the browser console I could easily spot them:
You just need to convert the dump data from binary to string and then from string to JSON. I’d also recommend converting from JSON to XML as it’s more easy and flexible to work with.