fail to use xpath to scrape

yxlyxl8 · April 3, 2023, 2:56am

I want to scrape news title and time from https://botanwang.com/top_right_news

My xpath for titlie is : //div[@id=“block-system-main”]/div/div/div/div/ul/li/span[1]/span/a

I test it for many times and i thought it is correct, but I scrape nothing.

my workflow is simple: table creator–webpage retriever–xpath

AlexanderFillbrunn · April 3, 2023, 7:10am

Hi @yxlyxl8,
I think the problem is that the document declares a default namespace at the top: xmlns="http://www.w3.org/1999/xhtml". In the XPath node you see that in the second tab:

To fix your query, you need to prefix each tag with dns:. I would extract the titles with the XPath query:

//dns:div[@class='content']//dns:div[@class='item-list']/dns:ul/dns:li//dns:a/text()

and you can get the times in a similar fashion. Instead of explicitly descending into every element along the path (/div/div/div/div/ul/li/span[1]/span/) it is usually more robust to use the // operator, which searches in all descendants, and couple it with a stable filter by element attributes, like I did above with the @class attribute.
Kind regards,
Alexander

yxlyxl8 · April 5, 2023, 9:26am

Dear Alexander,

Thanks for your help!!! I have successfully scrape the title and time!

Thanks again,
lxy

system · April 12, 2023, 9:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.