Selenium Nodes: How to scroll within a section?

kowisoft · March 18, 2021, 7:12pm

Hi KNIMErs (and webcrawlers hahaha),

I am working on a workflow to screen the job postings on LinkedIn automatically. I managed to log in and search for results and I also manage to extract the required properties using XPath (thanks to @qqilihq )

But this specific page has a scroll bar within a section (see screenshot below). I tried to use the “facebook approach” that is share on the Selenium Nodes website (Execute JavaScript with scroll to bottom) but this does not seem to work when the scroll bar is within a “section”.

Here’s the link: 553 Lego jobs in United States (23 new)

And this is the part / scrollbar I want to move to the bottom:

Any suggestions? Thank you in advance!

qqilihq · March 20, 2021, 10:11am

Hi Phil,

as discussed yesterday, some input here. First straight to your question, then describing the rabbit hole into which I fell when playing with the data. Maybe it sparks some more ideas

Scrolling Sections

As you highlight in the screenshot, the scrollbars on this page only apply to specific sections (technically, these are <div> elements which are vertically scrollable).

In the Facebook example, we could scroll the entire window, which works as follows (using the Execute JavaScript node):

window.scrollTo(0, document.body.scrollHeight);

In contrast to that, we first need to narrow down the scrollable element on LinkedIn (i.e. the section which shows the scroll bars). I do this with a Find Elements node where I get the element with the .jobs-search-results class. Then I pass it to Execute JavaScript, where I scroll this element (instead of the document):

/* This is the element passed from the previous Find Elements node;
 * I have selected it in the left column here. If I select a second,
 * third, … element, they would be available as arguments[1], 
 * arguments[2], … 
 */
const element = arguments[0];

/* Determine the hight of the element (includes height exceeding
 * the current screen height 
 */
const amountToScroll = element.scrollHeight;

/* Use the previously determined height to scroll */
element.scrollTop = amountToScroll;

Loading More Results

This works fine. At least the scrolling Unfortunately we’ll not get more results this way Instead we have to keep clicking a “Load more” button to load more results. So instead to what I described above, I built a loop which would continuously click that button to load more data (below example is rather dumb, it will just try to keep clicking, even though there are no more results – but never mind it works!)

Adding some more Extract Attribute and Extract Text nodes, and some post processing using a String Manipulation (Multi Column) node I end up with a nicely extracted and structured job listings table with Lego jobs:

(Bonus) So, Where Are All These Jobs?

Looking at the “Location” column, I thought that this would be a great use case to do some spatial analysis. So let’s show the job offers on a map! The Palladian Location Extractor will allow us to transform the location strings to latitude/longitude coordinates (and it even has some magic, aka. “disambiguation” built-in for properly detecting, if “Paris” is about Paris in France or Paris in Texas – a while ago I even wrote a dissertation about this topic, but this is yet another rabbit hole which is fortunately closed now ).

To use the Location Extractor, it’s necessary to set up a “Location Source” in the preferences (this is the database which is used for looking up the location data). You can use the free “Geonames” which allows 30,000 requests per day for free. (more information is shown in the node documentation, and we even offer a paid alternative for people who don’t feel comfortable sending their data to a public web service.)

After running this, and doing some filtering to only get the city names (e.g. not the regions or countries), I can then visualize the companies offering Lego jobs on the map.

I have shared the workflow on my public NodePit space (can definitely still improved, consider it a PoC for now ):

Have a good weekend,
Philipp

kowisoft · March 20, 2021, 11:46am

O-M-G!!!

Speak about rabbit holes

400px-rabbit-hole

That is fantastic!!! Thank you so much. I can see so many interesting ideas one could realize using the spatial analysis you added, @qqilihq with some of the jobs.

filter by location, to see where certain jobs appear (on the map image you shared, you can see a lot of the jobs are in the area of “Rheinland” in Germany)
do a competition research (my initial idea) to see where your competitors look for jobs (or in my case, as a procurement professional, do this for suppliers)
do some text mining with the jobs details link to get a heat map / word cloud about the skills the companies one is researching are looking for

etc. etc.

I will soon post an example workflow here using your fantastic Selenium and Palladian nodes (they are AWESOME!)…

kowisoft · March 23, 2022, 12:30pm

I just realized, that I never went on with this idea. So now that I have the luxury to spend more time with KNIME and workflows due to a job change, I wanted to post my result here.

You can find the workflow, based on @qqilihq 's great work here on the Hub:

Some things I have included:

(kind of) a user interface: Just right click the Select Competitor component and open its views to have a nice drop down of companies to “scrape”. I further on use this information for the LinkedIn Job Search URL and for the file name (I export to Excel)
I have used the following JS code to overcome the “infinite scroll” of the LinkedIn Job Results page. I embedded it into a “hard” loop (10x) - this works really well for me.

Here’s the code I used:

window.scrollTo(0, document.body.scrollHeight)

The 2nd to last node (Row Filter Node labeled: filter on selected Competitor) is just for the case, when you have a very common company name. In my internal use case I had the situation that there are 2 companies with the same name, but I was looking just for 1 of them. So this is a hard filter, you either have to adjust or simply delete that node.

I once again learned a lot, especially about extracting data with XPath which - I assume - is nice to know when you’re scraping the web

rake

Let me know, what you think

Iris · March 23, 2022, 6:38pm

I unfortunately had to learn that crawling linkedin is against their terms. We also had similar ideas but had to stop.
And in addition they constantly change their webpage so very shortly after I did (not) make b the workflow it failed again.

kowisoft · March 23, 2022, 6:51pm

Thank you for the feedback @Iris - understood. So it is - or better was - a technical case study and not to be used in real life examples.

I guess they want to sell more of their insanely expensive LI premium licenses

system · March 30, 2022, 6:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.