Trouble with XPath expression - extracting information from webpage

HSCH · January 29, 2025, 2:03pm

I am a beginner in KNIME and in webscraping.

I have an Excel list of products with article number, the products can be searched by the article numbers on the company website. I would like to add columns of product description and weight to the excel.
My current workflow looks like this

Now I’m stuck on the XPath node as I cannot find the correct XPath expression to get the information I need. (Error: selected XML element is not a tag nor an attribute)
The XML of the information I want to extract looks like this:

Would be great if someone can give me suggestions on how to solve this.
Are there other ways to extract information from websites?

elsamuel · January 30, 2025, 1:02am

Can you share the workflow on KNIME hub?
Or last least show us in a screenshot how you’ve tried to configure the XPath node?

I played around with it and got it to work, but I don’t know if my test case matches your use case:

HSCH · January 30, 2025, 10:40am

Hi
thank you so much for your reply. I actually don’t know how XPath works (don’t know how to make sense of the structure of an XML file), so I asked the KNIME AI assistant to suggest me a XPath query

Here is what it suggested, and it didn’t work out:

The strucutre of the actual website is quite complex? Here’s the webpage:1550000 Terminal box KX, sheet steel, without gland plate

It would be great if you can explain the basics of HTML, XML, or what ever I need to know that can help me configure the XPath node correctly.

Thanks a lot in advance!

MartinDDDD · January 30, 2025, 10:56am

To learn the basics I recommend w3school:

XML:

XPATH

HSCH · February 3, 2025, 4:32pm

Update: I have read the basics of XPath from W3School. Then I made several attempts to find the right XPath query, but failed.
The webpage XML is quite complex and I am not sure which levels should be included in the XPath query and which can be neglected.
Is there a best practice to define XPath query from complex XMLs?
Thank you.

badger101 · February 3, 2025, 11:44pm

I have looked at your case. I see that the issue comes from the fact that the GetRequest / Webpage Retriever Node not producing an XML column, but rather a string column instead, which in this specific case is not parse-able to JSON/XML.

Trying the KNIME Web Interaction (Labs) on this was a success for the first time on a non-headless browser, and then fails the second time due to Cookie wall. Seeing examples of workflows on the hub on how to get around that is by knowing the Cookies Xpath beforehand, which I have no idea how to obtain that info.

If someone here could help you on that issue, you would head into the right direction.

If you’re still not able to solve this after a few days, tag me and I’ll give you an alternative route:

This would require you to install NodePit first though, and then install Palladian. (Both are free).

HSCH · February 4, 2025, 9:17am

Hello, thank you for your reply.
The output of my WebPage retriever node did return a XML column, so I didn’t think it could be the problem, is it not in the right format?

badger101 · February 4, 2025, 11:30am

The XML column generated in this case barely works, probably since it’s not of a valid structure. You’ll get a valid one with the different routes I mentioned in previous reply. (Also, a python route could also be viable).

Alternatively, there’s also a manual route (without Palladian), but I wouldn’t recommend it since if you encounter an issue in the future, you’ll have to have some HTML knowledge and be able to identify patterns to fix it. Here’s what I get if I break things down manually:

So many ways you can approach things in Knime

HSCH · February 4, 2025, 2:19pm

Ok got it, thank you so much for the insight. I have installed NodePit and Palladian. Would be glad to learn the alternative route.
And yes, there’s so many possibilities in KNIME and so much to explore!

badger101 · February 4, 2025, 2:37pm

This is the Palladian route.

As you can see, there are possible “outliers” in certain webpages. Rest assured that this has nothing to do with the way the Xpath query is done here. Rather it’s because the webpage adds two additional rows in their table for that particular page, making it to have a slightly different layout than the rest.

You will need to inspect your results before you export them to your Excel file, otherwise you’d end up with these outliers.

To deal with the outliers, you’d have to manually isolate the affected rows from the output, and re-run the workflow with a different Xpath query, which I think you’ll have no trouble doing it on your own, but let me know if you need help with that.

Workflow

Webscraping Using Xpath by badger101.knwf (16.5 KB)

HSCH · February 4, 2025, 3:35pm

Can’t thank you enough! This is super helpful, I will definitely try this out

badger101 · February 4, 2025, 4:56pm

Update @HSCH , use this revised version if you’d like to manually inspect for oddities. It returns all rows containing the strings that you want regardless of how many copies there are in the corresponding webpage table, and regardless of which row position they are located in for cases where the page has more rows than usual.

In this case, the table has two rows with the substring “Product description”, while just one row for Gross weight and Net weight.

Using this info, you shall configure the last node (GroupBy) accordingly. Suppose you want the last non-missing row for the Product Description, here’s how you do it:

Revised workflow

Webscraping Using Xpath by badger101.knwf (27.7 KB)

system · February 11, 2025, 8:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.