extracting JSON data from website--how to use string to JSON node?

Glenn_Kuly · February 26, 2023, 6:15pm

Hello, this is my first question in the forum–I’m fairly new to Knime.

I’m trying to grab some JSON data from a website, so far I’ve been able to extract the following string (shortened in the middle) using a Regex Extractor node:

shotsData = JSON.parse(‘\x7B\x22h\x22\x3A\x5B\x7B\x22id\x22\x3A\x22506859\x22,\x22minute\x22\x3A\x2217\x22,\x22result\x22\x3A\x22BlockedShot\x22,\x22X\x22\x3A\x220.779000015258789
…
\x2D03\x2020\x3A00\x3A00\x22,\x22player_assisted\x22\x3Anull,\x22lastAction\x22\x3A\x22Rebound\x22\x7D\x5D\x7D’),

I cannot figure out where to start and end this string to make it readable in a String to JSON node, nothing I’ve tried so far has worked.

TIA for any suggestions.

– Glenn

ArjenEX · February 27, 2023, 9:27am

Hi @Glenn_Kuly

Your description is a bit confusing. You mentioned you want to grab some JSON data, then you parse a json but then you’re back with a string to convert it to a JSON again?

Could you please elaborate a little bit more on your current input and expected output preferably with some actual sources. It is then more likely someone will be able to jump in and provide some help

Glenn_Kuly · February 27, 2023, 8:33pm

Hi ArjenEX, by way of background, the data I’m after is contained on this webpage [Chelsea 0 - 0 Fulham (February 03 2023) | EPL | 2022/2023 | xG | Understat.com]

I’m not able not able to extract the JSON data directly from the webpage, “shotsData = JSON.parse” is part of the string I extracted (using Webpage Retriever and Regex Extractor). I know that in Python there is a way to extract JSON date in a string like this and transform it into a usable JSON file, but not sure how to do this in Knime.

– Glenn

ArjenEX · February 27, 2023, 10:47pm

Would something like this as end result be workable for you?

Glenn_Kuly · February 28, 2023, 12:10am

Yes! That’s exactly what I’m after. How did you do that?

– Glenn

ArjenEX · February 28, 2023, 11:52am

Allright @Glenn_Kuly! Below workflow should get you moving forward.

I must admin that the mentioned website is quite strange in structure and how this data is processed. The workflow might only work for this specific tabel.

After the webpage retriever, I opt for a Xpath node to extract the script section which contains the data via the query /html/body/div/div[3]/div[4]/div/script. It’s usually a bit more solid than Regex extraction. It will output that JSON.parse string as well

In order to use it though, I found out the hard way that there is a lot of cleansing to be done before it can actually be used. I put this cleanup in a Column Expression using function

join("[",replace(substr(column("script"),32,toInt(length(column("script"))-36)),"\\x","0x"),"]")

One of the actions it does is to capture it in an array so that you can perform collection queries later on.

Next is the replacement of the hex character codes. There is probably a more efficient way to do this, probably in Python, but given the time I went for the manual route with a nested replace() function.

String to JSON node will now give you a properly formatted JSON back which you can subsequently query.

This is done with a JSON Path node. Here you can extract all data fields with their corresponding values by using collection queries, outputted as Lists. For the ids for example, if you use $[*]['h'][*]['id'] it will extract all ids for the home team. You will notice in the JSON that there is an h group and a group. If you want to extract everything for the away team, you change the query to $[*]['a'][*]['id'] or convert it to a wildcard to retrieve both at the same time. This applies to all queries off course.

Next ungroup all created lists and you’ll have the stats in a table format.

See WF:
extract JSON data from football website.knwf (62.7 KB)

Hope this helps!

Glenn_Kuly · February 28, 2023, 12:14pm

Brilliant ArjenEX, thank you so much.

– Glenn

ArjenEX · February 28, 2023, 2:38pm

No problem! If it helped you please marked the post as solution so that other KNIME users can benefit from this more easily.

Glenn_Kuly · February 28, 2023, 3:13pm

Hi ArjenEX, its far more elegant extracting the JSON string from that website using XPath node (vs Regex Extractor), I can’t entirely follow the expression you used in the Column Expressions node (I’m still wrapping my head around the expression language in Knime) but once past that point it’s pretty smooth sailing.

I did not foresee having to transform hex to string (that’s mostly where i lost the plot here).

Thank you again so much for your help!

– Glenn

ArjenEX · February 28, 2023, 4:38pm

The expression is nested and combines 3 different operations:

substr(column("script"),32,toInt(length(column("script"))-36))
This is to trim the surplus characters from the string. Otherwise the JSON cannot be created. Practically this means trimming off shotsData = JSON.parse( from the beginning and ’), from the end. Since I don’t know how the long string is each time, determine this dynamically with length() and use an off-set.
replace(**string**,"\\x","0x")
This replaces \x with 0x. \x is not a valid hex character prefix, 0x is the proper one.
join("[",**string**,"]")
This puts the whole thing into an array that allows the collection queries to be applied at the end.

The final expression is a nested presentatation of this. So in the replace() function, the **string** used as input is the output of the substr() function under point 1. And so on.

Glenn_Kuly · March 6, 2023, 4:30pm

Hi Arjen, the column expressions you walked me through has made it possible for me to extract any JSON data I want from that website. Here’s a table built from shotsData (Arsenal v Everton 02 Mar 2023);

Can’t wait to visualize this, thank you again for your generous help!

– Glenn

ArjenEX · March 6, 2023, 4:33pm

Great to hear! Liverpool - Man United is going to be interesting one to visualize

Glenn_Kuly · March 7, 2023, 1:06am

For Man U fans maybe not so much

– Glenn

bruno29a · March 7, 2023, 4:39am

Removing my Like for you!!!

LOL J/k. It would still be interesting to see the stats for that game

system · March 14, 2023, 4:40am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.