Hey,
I have used the “html parser” to get content from a website (Vogue). Now I want to get only the text of the articles from the website. My problem is that I don´t know how to make my path on the “Xpath node”. Could you help me, please?
Hi,
if you just want the headlines of the homepage, you can use the query //h3/text() with return type String cell and Multiple tag options = Multiple Rows. To retrieve the text from the article pages, you can use //section[@data-test-id='ArticleBodyContent']//*/text(). This will give one row for each HTML element with text in the article and you may use for example a Column Expressions with this code to put everything together:
Hey,
thank you for your help!
I tried to do as you said and I managed to extract the headings, but now I have problems with retrieving the whole text from the article pages.
//section[@data-test-id=‘ArticleBodyContent’]//*/text() does not work, but I don´t really know why.
Could you help me, please?
Hi,
I don’t have access to the files you are loading in the workflow. Can you attach them as well? Otherwise I’ll have to use the Webpage Retriever, which might give different results.
Kind regards,
Alexander
Hi,
your data contains only the HTML file for the homepage, but not the ones for the individual articles. This is why the second XPath does not return anything. For the articles you need to download their HTML files as well.
Kind regards
Alexander