Scraping HTML

alfroc · June 4, 2024, 12:57pm

Hi there,
I can no longer extract some information from book reviews of an online bookstore (name, comment, rating) with the XPATH node. I don’t know what has changed on the site. What should I change in the query? Is there an easy way to get the correct query of the fields of interest?
Thank you very much!
Alfredo

Here is a sample workflow:
HTML_Scraper.knwf (13.2 KB)

ArjenEX · June 4, 2024, 7:36pm

Hi @alfroc

The clue is here that the section containing the review is now actually a JSON which cannot be queried with XPath.

What you still can do is extract that whole section based on /dns:html/dns:body/dns:script[7]
Once extracted, you can convert this string to an actual JSON format to run queries against with the corresponding node.

The name, comment and rating can subsequently be retrieved with associated JSONPath queries, like $['mainEntity']['review'][*]['author']['name']

If you output them as list and then Ungroup you’ll end up with each individual record.

The strange thing here is that according to the data, the rating is always 3,5 while the actual page shows something different. Not sure what’s going on there.

Anyway, the WF:
HTML_Scraper V2.knwf (235.4 KB)

Hope this still helps!

alfroc · June 5, 2024, 8:01am

Hi @ArjenEX ,
excellent solution, compliments! Too bad that the ratingValue column only represents the average value and therefore unusable for my purposes…
Thanks anyway!
Alfredo

system · June 12, 2024, 8:01am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.