API's for search engines

Hi,

 

For WebSearcher of Palladian no preference [API, identifier] can be set for Google or DuckDuckGo. Both search engines do not work in the WebSearcher node [the other search engines work].

 

For Google I have set the API/identifier for Google Search, but to no avail. For DuckDuckGo I have no idea whether or not an API is required, all I can see that the node does not work for it, and that no API can be set.

 

Is there a workaround for this?

Henk

Hi Henk,

 

DuckDuckGo does not require an API key and should therefore work without configuration (just checked on my installation and it works fine). Can you give me more details, what is not working? Do you get any error messages in the console log? What was your query, how many queriy terms did you have (= rows in input column) and how many results did you request (= setting in node)?

 

The Google configuration in the preferences (Google Custom Search API key, Google Custom Search API identifier) are only required for the Google Custom Search. The others (Google, Google Blogs, Google Images, Google News, Google Plus) work without API key. Same here, if you encounter any problems, please describe them in detail so that we can sort them out.

 

Best,
Philipp

Hi Philipp,

when i set up a table creator with keywords 'knime AND palladian', both websearcher nodes [duckduckgo and google] work.

when i add a second word on a second line [e.g. 'weka'] then both nodes fail.

when i remove the second line, duckduckgo runs and google fails.

i have experimented with different search terms and nr of lines, and it is not really possible to replicate the problem.

here are the console messages:

ERROR     WebSearcher     Execute failed: Parse error while searching for "weka" with DuckDuckGo (request URL: "http://duckduckgo.com/d.js?l=us-en&p=1&s=20&q=weka", result String: "var q=window.location.href.indexOf('?q');if (q!=-1) q=window.location.href.replace(/^[^\?]+\?q=\??/,''); else {q=window.location.href.replace(/^http:\/\/[^\/]+\/?/,'');q=q.replace(/\_/g,' ');};q=q.replace(/\&.*$/,'');var dnd0=[{"c":"http://www.google.com/search?q="+q,"u":"http://www.google.com/search?q="+q,"a":"","d":"google.com search","t":"EOF","i":"www.google.com"}];if (nrn) nrn('d',dnd0);")
ERROR     WebSearcher     Execute failed: Exception parsing the JSON response while searching for "weka" with Google: JSONObject["responseData"] is not a JSONObject., JSON was: "{"responseData": null, "responseDetails": "Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors", "responseStatus": 403}"
ERROR     WebSearcher     Execute failed: Exception parsing the JSON response while searching for "knime AND palladian" with Google: JSONObject["responseData"] is not a JSONObject., JSON was: "{"responseData": null, "responseDetails": "Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors", "responseStatus": 403}"

hope it makes sense to you.....suspected terms of service abuse....whoa!

henk

Hi Henk,

thank you for the detailed feedback, I'll have a look and try to reproduce it during the weekend.

Best,
Philipp

Hi Henk,

I had a look at this issue. Two things:

 

1) The "normal" Google searcher (except the Google Custom Search) use a depracted API for searching. Obviously, they have lowered the number of allowed queries during a time frame and block, when this limit is exceeded. See also on the API webpage: "Note: The Google Web Search API has been officially deprecated as of November 1, 2010. It will continue to work as per our deprecation policy, but the number of requests you may make per day will be limited. Therefore, we encourage you to move to the new Custom Search API."

 

2) For DuckDuckGo we use an inofficial "API" for accessing the search API, and while a small amount of queries obviously work fine, heavy use will also block the API.

 

In both cases, there is not much we can do about it (I modified the DuckDuckGo search to handle high query loads better, but I cannot promise the fix will be on a sustained basis). My recommendation: If you do heavy searching, rely on other web searchers (such as Bing), which provide an official API (and high query amounts per paid plans).

 

Best,
Philipp

Hi Philipp,

thanks for your detailed feedback and efforts. Luckily the other browsers are working. I have also the possibility to add a /json switch after a search in blekko.com. that gives the possibility to parse it out in R.

Henk

Hi there,

I am beginner in KNIME and would like to query the web for the term agriculture (well, I have a list of words) but I need to work with one term first and get the work flow right. My first problem is:

I have been trying to use web searcher (google engine) but it keeps giving me an error message. Actually all search engines give back an error message-see below:

ERROR     WebSearcher     Execute failed: Exception parsing the JSON response while searching for "KNIME" with Google: JSONObject["responseData"] is not a JSONObject., JSON was: "{"responseData": null, "responseDetails": "Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors", "responseStatus": 403}"
ERROR     WebSearcher     Execute failed: Could not instantiate ws.palladian.retrieval.search.socialmedia.FacebookSearcher, exception from constructor: accessToken must not be empty
ERROR     WebSearcher     Execute failed: Could not instantiate ws.palladian.retrieval.search.web.BingSearcher, exception from constructor: accountKey must not be empty
ERROR     WebSearcher     Execute failed: Could not instantiate ws.palladian.retrieval.search.socialmedia.TwitterSearcher, exception from constructor: consumerKey must not be empty
ERROR     WebSearcher     Execute failed: HTTP error while searching for "#agriculture" with DuckDuckGo (request URL: "http://duckduckgo.com/d.js?l=us-en&p=1&s=0&q=%23agriculture"): Exception org.apache.http.conn.ConnectTimeoutException: Connect to rundmc.duckduckgo.com:3433 timed out for URL "http://duckduckgo.com/d.js?l=us-en&p=1&s=0&q=%23agriculture": Connect to rundmc.duckduckgo.com:3433 timed out

How do I solve this?

The second question:

Where can I download the webcrawler workflow given in https://www.knime.org/files/knime_web_knowledge_extraction.pdf

I cannot seem to find it in the public server.

 

Thank you.

Grace

Grace,

the API keys for some search engines (Facebook, Bing, Twitter in your example) need to be set up in the KNIME preferences (KNIME > Palladian Web Searcher). Have a look at the node's documentation, we provide links to the registration pages there. For problems concerning Google and DuckDuckGo search please refer to this post.

Concerning your second question, I would recommend contacting the authors of the paper directly.

Best,
Philipp