Google Scholar Search

Hi everyone,

I have encountered a small problem on a model I am developing. I want to scrawl google scholar and access open source articles containing specific key words. I am using the get request node to perform such, but have not been successful with the obtained results. Any tips on how best I can accomplish that? I will appreciate the help.

Hi @Papoitema , welcome to the Knime Community.

Can you please elaborate more on what you did? Or even better, if you could share your workflow, it will be helpful.

Thank you @bruno29a, what I am trying to do is to source papers from google scholar so that I can perform some natural text analytics on them. I am using key words generated from previous processes I ran and those are being fed to the Get Request node to run the search on google scholar. However, I am not getting the expected result. see attached picture to view the portion of the model and kindly advise if you know a better way for me to get freely available articles from sources such as google scholar or other opened search engines.

Hi @Papoitema , no problem. Are you able to export your workflow and upload it here if there is no sensitive data? Or at least give me a couple of URLs you are sending to the GET Request node, and let me know what is the result you expect?

I can’t see what’s being passed or done from the image :slight_smile:

Thank you @bruno29a, I understand that the workflow would make it easier to get the required help, however, there are some sensitive data. I have enclosed pictures of the input into the GET REQUEST node and I am using a fixed URL, in this case I was trying google. I am expecting that by feeding the given input, I should get relevant freely available manuscript containing the specific input words from let’s say google scholar. It is probably not the right approach which I am taking and any help will be appreciated.
Thank you once again for your eager to assist.
STATEMENTS

I see that the GET REQUEST node only supports URL and the use of input I am working with is not generating results. Perhaps you may advice on how best I can approach the problem. The use of a different node or any other way around this node.

Hi @Papoitema , I understand that you cannot share the workflow.

The images you have provided are enough for now. As you have figured it out, you need to pass the URLs you want to query to the GET Request.

GET is just a way to communicate with a website where you would usually pass some parameters to the URL.

POST is another way to communicate with a website. When you submit credentials to log into a site, a webmail for example, these credentials are submitted as a POST request so that we cannot see what is being submitted, as opposed to GET where whatever is being submitted will be in the URL.

In your case, I would assume that you want to query a URL where you would pass the key words into the URL.

Do you know what is the base URL (that is the URL without any parameters, your starting URL) is?

For example, if you were to check out the data, what URL would you go to in a browser?

1 Like

Hi @Papoitema , I put a quick demo together for you. I will use Google Search as my URL.

If you do a search on google, you will notice that once you submit the search, the URL will be in the form of:
https://www.google.com/search?q=whatever+you+submitted

For example, if I submit a search for the word “test”, I will end up at:
https://www.google.com/search?q=test

Similarly if I submit a search for the word “knime”, I will end up at:
https://www.google.com/search?q=knime

And if I submit a search for a phrase, google replaces the spaces with a + sign. For example, if I want to search for “most recent movies”, I will end up at:
https://www.google.com/search?q=most+recent+movies

Do you see a pattern here? That means that I can query a search via the GET Request by attaching my key words to https://www.google.com/search?q= while making sure I replace spaces with +.

Here’s how the demo workflow looks like:
image

These are the key words I used:
image

And these are the URLs that got created:
image

This is the result:

You can then retrieve whatever info you want from the XML column.

This is obviously giving me the html code of the search results, that is why I get a whole HTML code from top to bottom (head to body). But if you are querying a site where you are supposed to get data feed, the results would be more straight forward.

Here’s the workflow:
GET Request for google search.knwf (12.6 KB)

3 Likes

Thank you @bruno29a `for a very detailed explanation. I highly appreciate. Will try the workflow or use the POST request node and revert to you.

Thank you once again.

Hi @Papoitema , you are most welcome.

The problem here, though, is not about using GET or POST (this could be something to look into further down), but about the fact that you are were not passing a URL to the GET Request node. The POST Request would also require that you pass URLs to it.

3 Likes

You are right, and using string manipulator helped in feeding the right URL with the respective key words.

2 Likes