I would like to extract all (relevant) URLs from a website, e.g. https://www.stepstone.de/jobs/data-science, to read out the text with the ‘Web Text Scraper’ node.
The ‘Get Request’ and ‘Webpage Retriever’ nodes did not work in my tests (error: ‘Read timed out’; timeout (s): 20).
I just tried to open the page in a browser and the request timed as well. So one explanation is, that the server isn’t responding in time.
Looking at the response with Postman, however, implies that the page is using JavaScript to prohibit scraping. If you’d like to continue building your workflow, I’d suggest doing a manual search and saving the page as HTML. You can read the page in KNIME with the XML Reader and establish the downstream processing. Your best option to get the page content directly might be the Python Integration and using Beautiful Soup.
Hi
in addition to @stelfrich s great post I would add have a look at their api. A first search result suggests that they have one so that would be my preferred way to go.
br