I have used the “html parser” to get content from a website (Vogue). Now I want to get only the text of the articles from the website. My problem is that I don´t know how to make my path on the “Xpath node”. Could you help me, please?
Here is the link to my workflow:
KNIME_project8.knwf (12.4 KB)
if you just want the headlines of the homepage, you can use the query
//h3/text() with return type String cell and Multiple tag options = Multiple Rows. To retrieve the text from the article pages, you can use
//section[@data-test-id='ArticleBodyContent']//*/text(). This will give one row for each HTML element with text in the article and you may use for example a Column Expressions with this code to put everything together:
var str = str ? (str + " " + column("article")) : column("article")
thank you for your help!
I tried to do as you said and I managed to extract the headings, but now I have problems with retrieving the whole text from the article pages.
//section[@data-test-id=‘ArticleBodyContent’]//*/text() does not work, but I don´t really know why.
Could you help me, please?
Here is my workflow:
KNIME.knwf (12.6 KB)
I don’t have access to the files you are loading in the workflow. Can you attach them as well? Otherwise I’ll have to use the Webpage Retriever, which might give different results.
Here is the file:
gecrawlte Seite.zip (64.6 KB)
your data contains only the HTML file for the homepage, but not the ones for the individual articles. This is why the second XPath does not return anything. For the articles you need to download their HTML files as well.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.