Scraping an AngularJS Website Post Load

qqilihq · August 30, 2019, 3:30pm

Thanks for the pointer @ScottF!

@mpfeifer14 The HTTP Retriever (or the GET Request node) or any other techniques will just download the static HTML will thus not be of any help, as AngularJS requires a JavaScript environment, which is just available in a JavaScript-capable web browser. Without that, you’ll see the website as shown in an ancient browser without JavaScript (which means, placeholders instead of the actual content). So at the end this is not a question of the right XPaths, but the way you download/“execute” the website.

With the Selenium Nodes you have the mentioned JS environment and the dynamic content will be rendered, you can extract it using XPath/CSS, and you can interact as you would as a human being.

As a disclaimer: As you have probably noticed already, the Selenium Nodes are a paid product (in contrast to Palladian which we – i.e. me and colleagues, independently from KNIME – provide for free for the regular KNIME platform, despite a considerable maintenance effort of both and the extensive support which we provide for free). And the paid licenses are the way to fund these efforts.

You can try whether the Selenium Nodes work for your use case with our free 1 month trial licenses, and I can assist in the case of any specific questions (best placed in the Palladian/Selenium sub forum).

Best,
Philipp