XPath for content extraction... once again

am_dbs · January 5, 2016, 9:58am

Hi,

again I've an issue with the extraction of content of a website...
I try to extract all vacancies of one website, with the selenium nodes I was able to extract the title and the link of every vacancy (many thanks to Philipp ;-)) . My question is: How can I extract the content of each "vacancy-link" in this specific case (I was able to do this for some other websites before).

The example-workflow is as follows:

Table creater with one "vacancy-link" to test (http://www.zimmer.com/careers/search/job-details.html?id=QCVFK026203F3VBQBV7V47VNG&nPostingID=7182&nPostingTargetID=22589&mask=zimextus&lg=EN) --> HttpRetriever --> HtmlParser --> XPath... doesn't work properly

When I run the "vacancy-link" with the IE browser and click right on the content which shall be extracted (the grey field) --> then click on "inspect element" respectively "Element untersuchen" --> it shows the html?!-code --> I would say the XPath has to refer to somewhat like div id="JD-AllFields". Unfortunately I can't find this line in the configuration of the Xpath node. Also the content extractor doesn't work properly.

Many thanks to you in advance!

Best

Simon

qqilihq · January 5, 2016, 4:50pm

Hi Simon,

neither the ContentExtractor nor the HtmlParser+XPath will work on that page, as the content in the grey box is pulled through an extremely messy series of JavaScripts (you can check that by disabling JavaScript in your web browser and trying to load the page ... you'll see an empty page). In that case, the Selenium Nodes would be your first choice. You can address elements by ID diretly through the FindElements node.

Best,
Philipp

am_dbs · January 5, 2016, 6:22pm

Hi Philipp,

MANY thanks, once again!!

I'm able to extract the content when I open the vacancy-link with the "start WebDriver":

Webdriver Factory --> Start WebDriver (vacancy-link) ---> Find Elements (by id: JDText-Field1) --> Execute JavaScript (return arguments[0].innerHTML;) works for one vacancy! By now I always use selenium to get the link to each search-result as a first step and then extract the content with xpath or contentextractor.

My next problem ;-) is:

What can I do if I want to extract all vacancies-links with the workflow as first step [Webdriver factory --> start webdriver (all vacancy-results: http://www.zimmer.com/content/zimmer-web-us/en/careers/search/jobs.html?LOV1=All&LOV2=All&LOV3=All&LOV4=All&ContractType=All&keywords=&jobnum=&Resultsperpage=50&srcsubmit=Search&statlog=1&ID=QCVFK026203F3VBQBV7V47VNG&mask=zimextus&LG=EN) --> Find elements --> execute javascript (extract the title) --> Extract Attribut ( get link to every vacancy using href)] and then want to extract each vacancy content. I try to add [click node --> find element (by id: JDText-Field1) --> execute javascript] but that only opens all vacancy links and ends with an error of the second find elements node.
How can I use the Find Elements node to extract multiple elements on one site? If I use 2 Find Elements nodes parallel (split the workflow) how can I put the results together in one row?

I hope my English is understandable ;-)

Many thanks again!

Best,
SImon

qqilihq · January 5, 2016, 6:54pm

Hi Simon,

(1) In general, do not perform any actions which change a browser's content, when you have tables with multiple rows. In that case, simulating a click works for the first row, however for the second row it will fail, because the WebElement is no longer available (as the page in the browser has changed).

I would recommend extracting all link targets (hrefs) first, and then adding a loop which performs your desired extractions step-by-step. You can either re-use one WebDriver and navigate using the Navigate node, or open and close a fresh WebDriver in each iteration.

(2) You can combine those branches using the Joiner (and a suitable join criterion), or the Cross Joiner to perform a "n x m" join.

Hope that helps,
Philipp

am_dbs · January 5, 2016, 10:08pm

Hi Philipp,

the extraction of all link-targets is done! That works fine!

What do you mean with "fresh WebDriver"; the both nodes "WebDriver Factory" and "Start WebDriver"?
How can I deliver the link-targets to a new "fresh WebDriver" (at first step just for one link, so without a loop)? I tried the "Flow Variables" Port (see png) but that ends in an error of the Start WebDriver node (Execute failed: Factory F:\xxx not found).

Sorry for the many questions... And many thanks :-)!

Best,
Simon

knime_jobs_12_1.png

qqilihq · January 5, 2016, 10:50pm

Hi Simon,

just sent you an e-mail, but after looking at your screenshot again, I suspect the problem is as follows: I assume you need a "Table row to flow variable" node between the "Extract Attribute" and the "Start WebDriver" node to convert the extracted link to a variable. You can then select the variable in the "Start WebDriver" node configuration.

To perform the extraction row-wise, use a "Start chunk loop" node and the corresponding end loop node, which will run each input row in isolation.

Best regards,
Philipp

am_dbs · January 6, 2016, 10:04pm

Hi Philipp,

awesome... many thanks!

It works (at first step without loop, so just for one vacancy-link) with the table to flow variable! I'll have a look at the loop-function tomorrow.

MANY thanks.

Best
Simon

system · April 21, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.