XPath for content extraction... once again

Hi,

again I've an issue with the extraction of content of a website...
I try to extract all vacancies of one website, with the selenium nodes I was able to extract the title and the link of every vacancy (many thanks to Philipp ;-)) . My question is: How can I extract the content of each "vacancy-link" in this specific case (I was able to do this for some other websites before).

The example-workflow is as follows:

Table creater with one "vacancy-link" to test (http://www.zimmer.com/careers/search/job-details.html?id=QCVFK026203F3VBQBV7V47VNG&nPostingID=7182&nPostingTargetID=22589&mask=zimextus&lg=EN) --> HttpRetriever --> HtmlParser --> XPath... doesn't work properly

When I run the "vacancy-link" with the IE browser and click right on the content which shall be extracted (the grey field) --> then click on "inspect element" respectively "Element untersuchen" --> it shows the html?!-code --> I would say the XPath has to refer to somewhat like div id="JD-AllFields". Unfortunately I can't find this line in the configuration of the Xpath node. Also the content extractor doesn't work properly.

Many thanks to you in advance!

Best

Simon

 

Hi Simon,

neither the ContentExtractor nor the HtmlParser+XPath will work on that page, as the content in the grey box is pulled through an extremely messy series of JavaScripts (you can check that by disabling JavaScript in your web browser and trying to load the page ... you'll see an empty page). In that case, the Selenium Nodes would be your first choice. You can address elements by ID diretly through the FindElements node.

Best,
Philipp

Hi Philipp,

MANY thanks, once again!!

I'm able to extract the content when I open the vacancy-link with the "start WebDriver":

Webdriver Factory --> Start WebDriver (vacancy-link) ---> Find Elements (by id: JDText-Field1) --> Execute JavaScript (return arguments[0].innerHTML;) works for one vacancy! By now I always use selenium to get the link to each search-result as a first step and then extract the content with xpath or contentextractor.

My next problem ;-) is:

I hope my English is understandable ;-)

Many thanks again!

Best,
SImon

 

 

Hi Simon,

(1) In general, do not perform any actions which change a browser's content, when you have tables with multiple rows. In that case, simulating a click works for the first row, however for the second row it will fail, because the WebElement is no longer available (as the page in the browser has changed).

I would recommend extracting all link targets (hrefs) first, and then adding a loop which performs your desired extractions step-by-step. You can either re-use one WebDriver and navigate using the Navigate node, or open and close a fresh WebDriver in each iteration.

(2) You can combine those branches using the Joiner (and a suitable join criterion), or the Cross Joiner to perform a "n x m" join.

Hope that helps,
Philipp

Hi Philipp,

the extraction of all link-targets is done! That works fine!

What do you mean with "fresh WebDriver"; the both nodes "WebDriver Factory" and "Start WebDriver"?
How can I deliver the link-targets to a new "fresh WebDriver" (at first step just for one link, so without a loop)? I tried the "Flow Variables" Port (see png) but that ends in an error of the Start WebDriver node (Execute failed: Factory F:\xxx not found).

Sorry for the many questions... And many thanks :-)!

Best,
Simon

Hi Simon,

just sent you an e-mail, but after looking at your screenshot again, I suspect the problem is as follows: I assume you need a "Table row to flow variable" node between the "Extract Attribute" and the "Start WebDriver" node to convert the extracted link to a variable. You can then select the variable in the "Start WebDriver" node configuration.

To perform the extraction row-wise, use a "Start chunk loop" node and the corresponding end loop node, which will run each input row in isolation.

Best regards,
Philipp

Hi Philipp,

awesome... many thanks!

It works (at first step without loop, so just for one vacancy-link) with the table to flow variable! I'll have a look at the loop-function tomorrow.

MANY thanks.

Best
Simon