XML attributes are not recognized in Xpath (double quotation marks)

Maike · April 22, 2021, 5:14pm

Dear Community,

My aim is to retrieve search results from the guardian API, with the following workflow being an example (retrieving RSS feeds from NYT):

Extraction and Tag Cloud Visualization of Named Entities from New York Times News Feeds – KNIME Hub

I am using this Guardian API request generator:

Guardian API scraper - KNIME Analytics Platform - KNIME Community Forum

Then, I use the ‘Webpage retriever node’ to translate the results into XML format. And would like to use the Xpath node, just as in the example of the NYT, to create seperate rows for each search result.

However, now my problem is that the Xpath does not recognize the attributes since the generated XML looks like this:

{

“response” : {

"status" : "ok",

"userTier" : "developer",

"total" : 117,

"startIndex" : 1,

"pageSize" : 10,

"currentPage" : 1,

"pages" : 12,

"orderBy" : "newest",

"results" : [ {

  "id" : "books/1920/dec/29/fromthearchives.poetry",

  "type" : "article",

  "sectionId" : "books",

  "sectionName" : "Books",

  "webPublicationDate" : "1920-12-29T23:46:12Z",

  "webTitle" : "Archive review: Poems by Wilfred Owen",

  "webUrl" : "https://www.theguardian.com/books/1920/dec/29/fromthearchives.poetry",

  "apiUrl" : "https://content.guardianapis.com/books/1920/dec/29/fromthearchives.poetry",

  "isHosted" : false,

  "pillarId" : "pillar/arts",

  "pillarName" : "Arts"

}, {

  "id" : "news/1920/dec/29/leadersandreply.mainsection",

  "type" : "article",

  "sectionId" : "theguardian",

  "sectionName" : "From the Guardian",

  "webPublicationDate" : "1920-12-29T00:04:38Z",

  "webTitle" : "From the archive: A revelation of the pity of war",

  "webUrl" : "https://www.theguardian.com/news/1920/dec/29/leadersandreply.mainsection",

  "apiUrl" : "https://content.guardianapis.com/news/1920/dec/29/leadersandreply.mainsection",

  "isHosted" : false,

  "pillarId" : "pillar/news",

  "pillarName" : "News"

}]

}

I believe the problem are the double quotation marks. But I just cannot figure out how to handle this (absolute newbe).

Your help is very much appreciated!

bruno29a · April 22, 2021, 6:35pm

Hi @Maike , the data you presented looks to be in JSON format.

I am not sure what you are trying to retrieve, but you can also have Knime extract these values into columns with JSON to Table node.

Your JSON string as input:

And converted to columns by Knime:

Here’s the sample workflow: json to table example.knwf (6.7 KB)

You can then filter out the columns that you do not need.

You can definitely use XPath too to extract only what you need, but the data you presented does not include the xml header, so it’s only recognized as a JSON string.

If you provide the full response, we can use XPath.

Maike · April 22, 2021, 7:57pm

Hi Bruno,

Thank you for your help! It looks at such a good idea, however it just don’t work in my workflow. I will attach the workflow, maybe that will help
My aim is to make a seperated row for every searchresult.

Many thanks!

WEB_API.knwf (38.8 KB)

bruno29a · April 22, 2021, 9:01pm

Hi @Maike , first of all, you should not be sharing your api keys

So, the data being returned is not really XML, and the fact that you use webpage retriever, it’s returning you the webpage html code altogether, meaning the etc tags, which you don’t need.

Instead, I used the HTTP GET node, which will return only the content of the page.

After that, I basically do some filtering and was able to get to the point of the response node.

I am assuming that what you want as rows, is basically the result array (id, type, etc), correct?

Here’s how the workflow looks like:

I’ve re-used your original workflow and whatever is in the green box is what I added.

And here’s the result of the last node:

I think that’s what you want, correct?

Here’s the workflow (note, I have removed the api key from it, you need to re-add it): JSON to table.knwf (48.8 KB)

Maike · April 23, 2021, 12:07pm

Thanks a million!

system · April 30, 2021, 12:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.