Scraping reference cites from google scholar with selenium

HeidelMS · January 20, 2020, 4:06pm

Hello.

I’m trying to scrape reference cites from google scholar using selenium. I tried to make a loop to click multiple popup boxes on the same page which the cites are. I’ve also tried to look for the direct links of the pop up elements to make the loop. But I couldn’t make it with these two approaches. If you know about a different one or maybe in some way make these approaches works I would really appreciate it.

Look at the workflow pictures I attached here.
Thanks in advance.

armingrudd · January 20, 2020, 7:40pm

Hi @HeidelMS and welcome to the KNIME community forum,

Here is an example to extract all BitbTex citations in a page using Selenium nodes:

selenium_cite.knwf (581.9 KB)

2 points:

You cannot pass several elements to click (webdriver tries to click them all). I used a Row Filter to keep one row regarding the current iteration number.
The Find Elements node which feeds the click must exist in each loop iteration (so put it after the loop start node)

HeidelMS · January 22, 2020, 9:14am

Thank you very much!! It works perfectly. I started to use Knime two months ago and I was stuck with this problem for two weeks. I also had a problem with the loop for pagination but I just changed some nodes to other places and It worked also.

armingrudd · January 22, 2020, 10:31am

So you should have visited KNIME Forum sooner.

Papoitema · October 21, 2021, 10:01pm

Hi @armingrudd, hope you are good. This modeled helps a lot with what I am doing. However, I tried to add a new find element and a click node to extract the abstract of the papers, it is giving me a warning of an empty data table and only shows one result. Not sure if you can assist with solving such. I would like to extract at least 5 abstracts per each topic. See attached model and thanks in advance.

Google scholar test.knwf (53.7 KB)

qqilihq · October 26, 2021, 5:14pm

@Papoitema Each of the iterations seem to return a different kind of articles. Thus the extraction which worked in iteration #1, will not work with the next result.

Generally, you’ll need to make this more generic. Note for example, that sometimes PDF are returned, sometimes HTML pages. I’d probably first collect all the result URLs from the search results in a first workflow step and then think of an approach to properly extract the content in a second step.

For PDFs, there are dedicated nodes:

For HTML pages, I’d probably extract the entire DOM source and then try this one from the Palladian plugin:

If you’re just starting, have a look at this thread for an entertaining introduction into web scraping with the Selenium and Palladian nodes:

Fingers crossed!

Papoitema · October 27, 2021, 5:57pm

Hi @qqilihq thank you so much for the response. Let me go read the proposed thread and try out your suggestions. Thanks once again

system · July 22, 2022, 3:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.