Scraping an AngularJS Website Post Load

mpfeifer14 · August 29, 2019, 12:34am

Hello Everyone,
Was wondering if someone could help. I am trying to scrape an angularJS website. When using the HTTPRetriever it only scrapes the HTML without the variables names written in/still has only placeholders. Was wondering if anyone knows how to work around this. Is there a way to allow for the page to fully load? A special trick to retrieving xpaths for Angular websites?

Preferably without having to use a selenium node but that might be overly picky.

Thanks in advance!

ScottF · August 30, 2019, 2:37pm

Hi @mpfeifer14 -

Welcome to the forum! Since both the Palladian nodes (of which HTTPRetriever is a part) and the Selenium nodes are maintained by @qqilihq, I’ll tag him here and see if he has a good suggestion for you.

qqilihq · August 30, 2019, 3:30pm

Thanks for the pointer @ScottF!

@mpfeifer14 The HTTP Retriever (or the GET Request node) or any other techniques will just download the static HTML will thus not be of any help, as AngularJS requires a JavaScript environment, which is just available in a JavaScript-capable web browser. Without that, you’ll see the website as shown in an ancient browser without JavaScript (which means, placeholders instead of the actual content). So at the end this is not a question of the right XPaths, but the way you download/“execute” the website.

With the Selenium Nodes you have the mentioned JS environment and the dynamic content will be rendered, you can extract it using XPath/CSS, and you can interact as you would as a human being.

As a disclaimer: As you have probably noticed already, the Selenium Nodes are a paid product (in contrast to Palladian which we – i.e. me and colleagues, independently from KNIME – provide for free for the regular KNIME platform, despite a considerable maintenance effort of both and the extensive support which we provide for free). And the paid licenses are the way to fund these efforts.

You can try whether the Selenium Nodes work for your use case with our free 1 month trial licenses, and I can assist in the case of any specific questions (best placed in the Palladian/Selenium sub forum).

Best,
Philipp

mpfeifer14 · August 30, 2019, 9:26pm

This worked well! The Selenium nodes operate a little differently because they approach things from the front end in. Very versatile way to do because of this it is so simplistic in how it operates. This solved all of my issues crawling an AngluarJS website in Knime.

Thanks for the help!

system · February 29, 2020, 9:27am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.