Trouble with XPath expression part 2

This post is the continuation of a previous one: Trouble with XPath expression - extracting information from webpage

I now have a similar task only with a longer product list and different companies (i.e. different webpages).
I still find it difficult to breakdown the structure of an XML to obtain the correct XPath query, and this gets even more complicated when every webpage has a different structure.
Does anyone have tips on that? Would it be possible to have an automated (or at least half-automated) workflow to lessen the workload of doing it manually?

Also, this got me pretty intrigued in webscraping, can someone point me to where I can learn the basics systematically?

Many thanks in advance

Hello @HSCH,

A good starting point on web scraping would be this blog post:

For semi automated extraction, I believe you would need to use relative paths / wildcards.

Also see some more documentation on wildcard usage and similar:

Since you mention that they are different websites, that would be a pretty difficult task to automate, but you could try something like

//a[contains(@class, ‘article’)]

Or similar patterns across the websites. From there you can try to filter or clean the output.

TL

3 Likes