Parsing a website

kowisoft · April 13, 2019, 7:48pm

Hey there,

I want to parse a job board (https://www.accenture.com/de-de/careers/jobsearch) on a website and extract all the titles in the ads together with the links.

I tried to follow through based on the parsing the KNIME forum workflow from the example server but this returns only missing values. I assume that this is due to the fact that the forum structure may have changed in the meantime.

The main problem I have that in the results I get from the HTML parser node I cannot find any of the ads.

I just checked in the browser and the ads are there but none of them could be found in the XPath node, I have put behind the HTML parser.

I thought I look up their specific “identifiers” such as class names or id’s, search for them in the XPath node and then let the node create the corresponding XPath query. But just the surrounding website framework is there but not the content itself (nav is there, footer is there etc.)

I assume that the page loads the content dynamically, could that be the issue?

Any help is very much appreciated. Thanks!

Here’s my workflow so far…

ACCN_job_board.knwf (105.5 KB)

armingrudd · April 14, 2019, 2:59am

Hi,

If you check the page source you would notice that the “div” element related to the search result section is created by an event. For these cases I suggest using Selenium nodes in KNIME.
Here you can read a short tutorial:
https://blog.statinfer.com/rule-the-web-with-selenium-nodes-in-knime/

Best,
Armin

kowisoft · April 14, 2019, 8:08am

Hi @armingrudd …

perfect, exactly the tutorial (series) I was looking for. Thank you!

PS: Followed the amazing tutorial which @armingrudd also posted here in the forums and now I am able to scrape the job boards to do some text mining magic

Tyler · April 14, 2019, 8:35pm

I’m a big fan of web scraping. Did you check with accenture before auto querying their job boards with Knime? Hopefully sharing this info will help you solve the problem faster.

I’d recommend asking Accenture if it’s okay to scrape their job data because it’s their website and their data.

Querying Accenture data, over and over, will throw a flag eventually. Depending on your reasoning, it may be safe.

A good example: If they find you’re using their data, without permission… Accenture may ask you to stop.

A good path: Usually websites have these things nested in their robots.txt file, accenture does not, facebook for example does.

Maybe looking through their website and asking them directly will offer quick insights into whether or not it’s cool.

Please let me know your findings, I too enjoy web scraping for job related data. However, a few websites will ban your IP if you GET REQUEST their content too much.

A great example is auto querying a search phrase through Google, to understand ranking, text analysis in the ranking, etc… you’d think “oh cool let me do a bunch” but after 2.5k you’re banned from search for the rest of the day.

I’m not the best web scraper but a few years of playing around, I have found my only limitation is what I’m scraping and how often. Which leads me to usually spending more time making the bot (web scraper) nicer… lol

kowisoft · April 14, 2019, 8:45pm

Thank you @Tyler …

you bring up a good point. Actually here’s what I want to do (and it may show, that Accenture - amongst others - will probably not comply to my request.

I am a procurement professional in the market of IT and here - of course - Accenture is one of the big global players. The reason I want to scraper their job board is the following:

Imagine a situation where I as a buyer call their sales and ask them …

“… hey, Mr Salesman, can you tell me what areas you’re growing in so I have a good feeling for which projects to request offers for and which not…?”

him / her:

“… well, Mr BeginnerBuyer… let me tell you, we’re good in nearly everything so just send any request you have to me…”

If I am really “Mr BeginnerBuyer” I would do so and quickly learn that what MrSalesman told me is not 100% true. There are niches they are especially good in and others they’re not.

The web sraper and later on text mining should give me an insight in which areas they are growing / hiring as a first indicator (let’s say e. g. in a word cloud)…

I would probably run such a workflow once per month to create some kind of dashboard for my most important suppliers.

Not sure if this frequency raises any flags but pretty sure, that they wouldn’t like me as a customer apply my own filters

Tyler · April 14, 2019, 9:17pm

Sounds smart. I have many robots (apps) that do a lot of similar “good grief I don’t want to do this task”… So if you’re not selling their data or trying to out rank them with the same text (via digital marketing), I can’t see why this would be a negative. If you can see it publicly and robots text isn’t super dismissive, or their “privacy page” doesn’t explicitly explain NO… Then hammer on!

Sounds a lot like the things I was trying to learn in code before I started learning about these tools. Great to meet you.

Best,
Tyler

kowisoft · April 14, 2019, 9:21pm

Thank you for the kinds words @Tyler

Doing this out of two purposes:

data science power to the people Although some data scientists may not like it but I strongly believe that this is one of the use cases that clearly demonstrates how ordinary “business process owners” (there you have it… buzzword alert!! ) could leverage the power of data for their own use without having to ask others…
this would help me going from REactive to PROactive (and make my boss like me even more lol )

system · April 21, 2019, 9:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.