How to analyse a website

Hi,

I'm trying to familiarize myself with KNIME - I am an absolutely KNIME-beginner.

My vision is to evaluate with KNIME, who deals with which technologie. As a first step, I want to evaluate the Website „http://3druck.com/“. The website lists different short articles - now my first problem: How can I automatically put each article in KNIME? Is that possible anyway?

After searching the forum I came across the "HttpRetriever" – But when I leftclick the „HttpResult output“ it only shows the adress oft the website and question marks.

Many thanks in advance!

Hi, do you already know about our Example Server? There we have varius text mining workflows included and there are  at least two which analyze the KNIME forum itself. Those you could adapt for analyzing the website.

Best, Iris

One additional note concerning the HttpRetriever: In case you have ? (missing cell) results, retrieval of a URL failed. Additional details for the reason should be shown in in the log output.

Philipp

Hi Iris, Hi Philipp,

many thanks for your answers!

I know the Example Server - I tried to adapt the "Palladian_01 Parse a website" workflow. But I had some trouble with the XPath node, but finally I was able to extract the shown articles from the website.

I solved the problem as follows:

  • get "all" htmls-links of each article
    • "Table Creator" with the adress of the website --> HttpRetriever" --> "HtmlParser" --> "XPath" --> "Column Filter"
  • get the content of each link
    • --> "HttpRetriever" --> "HtmlParser" --> "Xpath" --> "Column Filter"

But, I have a new problem: At the moment only the first 36 articles of the website can be extracted. I guess that's because at the bottom end of the website you must click "Mehr laden" to view more articles. How can I handle this?

Is there another possibility to extract articles from websites? Otherwise I must adapt the "XPath" for every new website I want to parse.


Many thanks in advance!

Good evening,

in case you need to follow some "Get more results" link there are several options, depending on how the results are pulled into the web page:

In case the "Get more" link simply performs a page refresh, you could try building all possible URLs in advance and then process them sequentially. Traditionally, pagination is done by adding some parameter to you query URLs, such as http://example.com/results?page=1, http://example.com/results?page=2, …, http://page.com/results?page=n. If that's the case you can simply build up a list of all subsequent URLs (or better, use some looping nodes), fetch them using the HttpRetriever and extract the desired data.

However, it is much more common nowadays, that additional results are integrated into a loaded web page dynamically without a page reload through JavaScript (this is also the case for the sample URL http://3druck.com in your original post). Sometimes, pages append additional results when you scroll down the browser window (called "infinite scroll"). This makes extracting data more complicated. In general there are two options: Have a look at the JavaScript source of the corresponding web page, understand the logic and the necessary AJAX requests for retrieving results (usually, JSON data is used) and build some KNIME workflow which fetches your desired data (you can use a combination of the HttpRetriever and the JSON nodes).

A further option are the Selenium Nodes which we (i.e. the developers of the Palladian nodes) have created: Selenium allows you to simulate a real web browser. This way, you can mimick human interaction (in your subsequent scrolls or clicks on your "Mehr laden" links) and then extract your data. The Selenium Nodes are currently in beta; if you want to have a look, you can download them here. There is also a sample workflow available here (search results scraping) which is quite similar to your use case. If you should encounter any issues using the Selenium Nodes, do not hesitate to get in touch (I also speak German, if that's your preferred language ;-) ).

Concerning your second question about the necessary adaptation for every website: This is usually the way which you will have to go. You can however try to make your XPath queries more generic which might work, if your source data is from a single domain topic-wise, e.g. only news data (I have for example sucessfully build workflows which used some positive/negative indicators as dictionary for selecting link elements from a DOM tree which worked quite well/accurate for different websites).

Sorry this post turned out to be quite long, still hope I could give some insights :)

Best,
Philipp

Hi Philipp,

das Du deutsch kannst vereinfacht die Sache erheblich! ;-)

Erstmal vielen Dank für Deine Antwort!

Die Lösung mit den "Selenium Nodes" klingt äußerst interessant. Installiert habe ich diese Beta-Nodes bereits - ebenso den Beispiel-Workflow "SeleniumWebScraping". Allerdings läuft der Beispiel-Workflow (ich habe bisher nur den oberen Pfad des Beispiel-Workflow getestet) bei mir (noch) nicht richtig: Den InternetExplorerDriver habe ich installiert und im Beispiel-Workflow im Node "WebDriver Factory" ausgewählt. Wenn ich den Node "Start WebDriver" ausführe öffnet sich auch der IE mit der google-website. Allerdings weiß ich nicht wie es dann weiter geht. Muss ich einen Suchbegriff in die sich öffnende google-website eingeben und dann die restlichen Nodes "Find Elements", "Send Keys" ... ausführen? Der Node "Send Keys" beinhaltet in der Konfiguration den Text "did han shoot first?" ich vermute, dass es sich um den Suchbegriff handelt?!

Bis jetzt erhalte ich lediglich eine leere Ergebnistabelle - dies wird mir auch in der KNIME console für alle Nodes ab "Find Elements" angezeigt.

Vielen Dank und beste Grüße

Simon

Ich habe es immer noch nicht zum laufen bringen können, aber immerhin erhalte ich nun einen Fehler in der KNIME console ;-). Der Workflow ist wie folgt:
"WebDriver Factory" (hier ist InternetExplorerDriver ausgewählt) --> "Start WebDriver" ---> "Find Elements" --> "Send Keys" --> ...

Rechtsklick "Start WebDriver" Execute öffnet google-website; Rechtsklick "Send Keys" Execute führt zu einem Fehler (rotes X) des vorgeschalteten Nodes "Find Elements". Folgender Fehler wird angezeigt:

ERROR Find Elements 2:5 Execute failed: Unable to find elements on closed window (WARNING: The server did not provide any stacktrace information)

Command duration or timeout: 21 milliseconds

Build info: version: 'unknown', revision: 'unknown', time: 'unknown'

System info: host: 'SurfacePro', ip: XXX, os.name: 'Windows 8', os.arch: 'amd64', os.version: '6.2', java.version: '1.7.0_60'

*** Element info: {Using=name, value=q}

Session ID: 411ba696-a8b7-4140-9ff3-e01ef81ba1cf

Driver info: org.openqa.selenium.ie.InternetExplorerDriver

Capabilities [{platform=WINDOWS, javascriptEnabled=true, elementScrollBehavior=0, ignoreZoomSetting=false, enablePersistentHover=true, ie.ensureCleanSession=false, browserName=internet explorer, enableElementCacheCleanup=true, unexpectedAlertBehaviour=dismiss, version=11, pageLoadStrategy=normal, ie.usePerProcessProxy=false, cssSelectorsEnabled=true, ignoreProtectedModeSettings=false, requireWindowFocus=false, initialBrowserUrl=http://localhost:43296/, handlesAlerts=true, ie.forceCreateProcessApi=false, nativeEvents=true, browserAttachTimeout=0, ie.browserCommandLineSwitches=, takesScreenshot=true}]

Lässt sich damit irgendetwas anfangen?

Vielen Dank und beste Grüße

Simon

 

Hi Simon,

I'm going to answer in English, as this might be of general relevance. If you have any further questions, simply drop me an e-mail to mail@seleniumnodes.com and we can also discuss in German.

The goal of the Selenium Nodes is to automatize a browser, so when you're running a workflow, you should usually not be interacting with the browser window, but let everything be done by the workflow which you're executing. For the sample workflow, try the following steps: reset the "Start WebDriver" node which description says "open google.com". Then select the last node in the workflow, which is a "Column Filter" node with the description "search results" -- execute this node and all previous nodes should sequentially execute and perform some actions in the opened web browser. The result of the Column Filter node should then be some search results.

Important: During the execution do not close the browser window. I suspect error saying "ERROR Find Elements 2:5 Execute failed: Unable to find elements on closed window (WARNING: The server did not provide any stacktrace information)" was caused by a closed browser window.

Hope that helps.

Best,
Philipp

Hi Philipp,

(auf das Angebot Dich per Mail zu kontaktieren komm ich gerne zurück.)

many thanks for your answer!

I was not able to make the workflow work with the IE Driver or the operachrome driver. (I installed both browsers and configure the respective driver in the preferences of KNIME, I didn't close the window of the browser during the workflow-process)

Finally I realised that you could chose the firefox browser in the configuration of the "WebDriver Factory" node. That works pretty cool. The selenium nodes are really outstanding!

 

Best

Simon