Webcrawler Workflow

Vanessa_030 · June 9, 2016, 10:06am

Hi Everybody,

I am quite knew and I want to try the webcrawler workflow to analyze a website and filter out all links,

I used the following nodes (See attachment).

Now I have some problems with the XPath Node. The XML Code is the following (see attachment).

Now I want to filter out all the links from this website.

Can anybody help me with filling in the Xpath dialog?

Which XPath Query should I used to have all links seperated?

Many thank to you in advance.

Best

Vanessa

qqilihq · June 9, 2016, 10:16am

If you want to extract all hyperlinks in a document, the query should be //a. (if you actually want to filter 'link' elements, which are usually only used in a page's header to link to RSS feeds, etc., the query would actually be //link).

Philipp

Vanessa_030 · June 9, 2016, 11:10am

Thank you Phillip,

I tried both, but there is always a "?" in the XPath Column. :/ What am I doing wrong?

Best

Vanessa

qqilihq · June 9, 2016, 1:18pm

Hi,

the query needs to be prefixed with the namespace. In the XPath node's default configuration, this should be dns. So the query is //dns:a (check the namespace tab and the XML file for the xmlns attribute).

Kind regards,
Philipp

Vanessa_030 · June 10, 2016, 11:26am

Ahh, ok now it works with this example. Thank you.

Now I need to try it with this site: https://recruiting.bmwgroup.de/ibs/Servlets/ibs/controller/sm

unfortunatelly is the worklfow not working with this site. Because the site before is this https://recruiting.bmwgroup.de/ibs/Servlets/ibs/controller/sm?event=__activate_and_reset&target=smerweitertesuche&sprache=de

On this site I put a filter so show all relevant jobs and this site afterwars I actually want to filter. But here the workflow does not work. So When I use the workflow the site is turning into the error page ("This site is not available anymore") which appears when refreshing the site with the results.

I hope anybody understand my problem and can help. :/

Thank you in advance.

Vanessa

marco_ghislanzoni · June 10, 2016, 12:09pm

The site you are trying to crawl has dynamically generated content which depends on the session. It is likely that the session is no longer valid when you access it from the workflow, hence the issue.

You can create a valid session by "simulating" an interaction with the site first, like submitting a search etc. Look at the Selenium nodes, they have such capability.

Also make sure what you are doing is allowed. Some sites have very strict policies on what you can automatically "capture" out of them. Check their robots.txt.

Cheers,
Marco.

qqilihq · June 10, 2016, 11:07pm

Hi Vanessa,

I agree to Marco, this task can better be solved using the Selenium nodes than Palladian. If you haven't found out yet, the Selenium Nodes are available on seleniumnodes.com. You can also use XPath queries there to extract information (besides various other techniques). In case you have any specific issues or questions, don't hestitate to get in touch, ideally in the Palladian+Selenium subforum.

Good luck!
Philipp

PS: Disclaimer: I'm the main developer of both the Palladian + Selenium nodes.

PPS: Looking at you're screenshots I would strongly recommend updating to KNIME 3. Version 2.x seems sooo retro now :)

Geo · June 11, 2016, 12:18am

I'd like to stress the importance of the site's terms and conditions as well. Whenever a website says that its content is strictly for personal use or when it explicitly forbids the use of web crawling, you should refrain from grabbing the information in this way.

Geo · June 11, 2016, 12:19am

[double entry]