Scraping reference cites from google scholar with selenium

qqilihq · October 26, 2021, 5:14pm

@Papoitema Each of the iterations seem to return a different kind of articles. Thus the extraction which worked in iteration #1, will not work with the next result.

Generally, you’ll need to make this more generic. Note for example, that sometimes PDF are returned, sometimes HTML pages. I’d probably first collect all the result URLs from the search results in a first workflow step and then think of an approach to properly extract the content in a second step.

For PDFs, there are dedicated nodes:

For HTML pages, I’d probably extract the entire DOM source and then try this one from the Palladian plugin:

If you’re just starting, have a look at this thread for an entertaining introduction into web scraping with the Selenium and Palladian nodes:

Fingers crossed!