Making XML fun again..... ;-)

takbb · May 30, 2021, 7:22pm

There have been a number of posts on the forum regarding reading XML files. I think reading XML is one aspect of KNIME where I don’t feel it is quite as easy as it ought to be. To be fair, the reading of XML is simple (you just load it with an XML Reader) but it’s what you do with it after that that presents the challenges.

The XPath node is very capable but in my view is fine for a handful of columns, but just doesn’t scale well (in practical/human terms). Maybe I just get bored too easily. Doubtless it’s a clever node, it’s handling of XPath is I’m sure wonderful, but I’ll never really appreciate it because if I have a large XML file to be read, I will lose interest in trying to configure every Xpath long before my coffee has gone cold!

Turning to JSON to do the job just doesn’t do it for me, especially as I still then have to overcome the issue of handling those many-to-one relationships that I would like to easily break out into separate tables.

Ultimately, for the average person who just wants to read a very basic XML file in and get on with the actual job of analysing the data, it should be (almost) as straightforward as loading a CSV file.

Now, I accept that XML files and tables don’t make easy friends under all circumstances, but after seeing a few recent forum posts, it got me wondering why should it be any more difficult to load an XML file into KNIME than it is to load it into Excel? Why do we need to resort to “convert to JSON” and back again in what I would suggest is not-particularly friendly or intuitive process. If my XML has 5 columns, why do I need to spell out the XPaths? Yes, why?

So over the past week or so, I’ve been looking through the forum, seeing what other people have suggested over the recent months, and years. Ultimately to do anything programmatically with the XPath Node is I feel a non-starter. If it allowed me to pass in the XPath info as arrays, then maybe it could work (and I’ve looked at some interesting and useful posts on the subject) , but as it stands, you are swimming upstream the entire way and even then you have no guarantee of success - and its a lot of nodes, and it’s just not generic!

So what are the other options if not the standard nodes? Well… I thought…would it be possible to write a component to do it? Why not… we’ve got Python… so the day before yesterday, and yesterday, and today… I’ve been researching… and writing… and testing…

Let’s say we had a piece of XML, that looked like this?

Wouldn’t it be great if (with approximately zero configuration) it turned into this:

or if this:

turned into this:

or if, with only a minor piece of configuration
this piece of xml: demo5.xml (1.4 KB)
which contains two nested many-to-one data structures (addresses, and pets) which don’t lend themselves to being in a single table without creating repeating groups, could be turned into
this:

and this:

Well… that’s what my new toy does ;-), and if you’d like to try it out for yourself (and have fun with XML again… ) you can.

I’ve uploaded a demo workflow here:

and the XML Easy Reader 1 (prototype version 1!) can be downloaded from here:

I’d welcome feedback and obviously it’s still early days so if you find any bugs please let me know. There are some config options - see the documentation on the component, but if it’s not clear how to make it work, please let me know. Doubtless there are situations I’ve not yet considered but my hope is that it works really easily for basic (row-column style) XML, but also allows me/you to work more easily with reading more complex structures without lots of configuration. Enjoy!

tobias.koetter · May 31, 2021, 6:29am

Hi takkb,
thanks a lot for this great component and all the research you did. We will take this component as an inspiration for the planned XML Reader rewrite and try to support autodetection of the structure as well.
Thanks for all your great contributions
Tobias

takbb · May 31, 2021, 8:59am

Thanks @tobias.koetter , the component still needs a little refinement, and I could do with improving how it handles errors if the XML isn’t as “expected”, or maybe if its default assumptions about the structure are wrong. At the moment there are a few times it fails rather unceremoniously, but hopefully I can improve that by looking at more examples and working out how to handle them.

I’ll also tidy up my python code somewhat, and add a few comments (before I forget why I did certain things ), but the python code itself is reasonably small and the ideas ought to be reasonably easy to port to Java. I’ve yet to look at writing an actual node for KNIME. I decided to wait until the next release of KNIME before I download the sdk and have a play. I’m more a java programmer than a python one, so I guess it should really be my next step, but I find that prototyping an idea in Python to be really easy with KNIME.

Would be nice to see further XML support in the KNIME nodes, and happy to share ideas if that would be useful.

Brotfahrer · June 9, 2021, 1:52pm

Hello @takbb ,

I’m really interested in your XML Easy Reader 1. Unfortunatelly I’m not able to download your Example Workflow from the KNIME Hub

@tobias.koetter : in the knime-log I have the following entries:
2021-06-09 15:34:12,682 : WARN : ModalContext : : ExplorerURIDropUtil : : : Object at URI ‘https://hub.knime.com/takbb/spaces/Public/latest/Components/XML%20Easy%20Reader%201’ not found
2021-06-09 15:40:47,468 : WARN : ModalContext : : HubURIImporter : : : Hub request failed
javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: SSLHandshakeException invoking https://api.hub.knime.com/knime/rest/v4/repository/Users/takbb/Public/Components/XML%20Easy%20Reader%201: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at org.apache.cxf.jaxrs.client.AbstractClient.checkClientException(AbstractClient.java:633)
at org.apache.cxf.jaxrs.client.AbstractClient.preProcessResult(AbstractClient.java:607)
at org.apache.cxf.jaxrs.client.WebClient.doResponse(WebClient.java:1145)
at org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1082)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:927)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:896)
at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:364)
at org.apache.cxf.jaxrs.client.WebClient.get(WebClient.java:390)
at com.knime.explorer.server.internal.HubURIImporter.callHubApi(HubURIImporter.java:452)
at com.knime.explorer.server.internal.HubURIImporter.callHubEndpoint(HubURIImporter.java:409)
at com.knime.explorer.server.internal.HubURIImporter.createRepoObjectImport(HubURIImporter.java:149)
at org.knime.workbench.core.imports.URIImporter.createEntityImport(URIImporter.java:81)
at org.knime.workbench.core.imports.URIImporter.createEntityImport(URIImporter.java:105)
at org.knime.workbench.core.imports.URIImporterFinder.createEntityImportFor(URIImporterFinder.java:118)
at org.knime.workbench.explorer.view.dnd.ExplorerURIDropUtil.lambda$0(ExplorerURIDropUtil.java:175)
at org.eclipse.jface.operation.ModalContext$ModalContextThread.run(ModalContext.java:122)
Caused by: javax.net.ssl.SSLHandshakeException: SSLHandshakeException invoking https://api.hub.knime.com/knime/rest/v4/repository/Users/takbb/Public/Components/XML%20Easy%20Reader%201: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.reflect.GeneratedConstructorAccessor114.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1402)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1386)
at org.apache.cxf.io.AbstractWrappedOutputStream.close(AbstractWrappedOutputStream.java:77)
at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:673)
at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:705)
at org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1081)
… 12 more
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.reflect.GeneratedConstructorAccessor114.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1950)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1945)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1944)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1514)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
at org.apache.cxf.transport.http.URLConnectionHTTPConduit$URLConnectionWrappedOutputStream$2.run(URLConnectionHTTPConduit.java:377)
at org.apache.cxf.transport.http.URLConnectionHTTPConduit$URLConnectionWrappedOutputStream$2.run(URLConnectionHTTPConduit.java:373)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.cxf.transport.http.URLConnectionHTTPConduit$URLConnectionWrappedOutputStream.getResponseCode(URLConnectionHTTPConduit.java:373)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.processRetransmit(HTTPConduit.java:1450)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleRetransmits(HTTPConduit.java:1437)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1567)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1373)
… 19 more
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1967)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:331)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:325)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1688)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:226)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1082)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:1010)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1079)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at sun.net.www.protocol.http.HttpURLConnection.getHeaderFields(HttpURLConnection.java:3084)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getHeaderFields(HttpsURLConnectionImpl.java:297)
at org.apache.cxf.transport.http.Headers.readFromConnection(Headers.java:281)
at org.apache.cxf.transport.http.URLConnectionHTTPConduit$URLConnectionWrappedOutputStream.updateCookiesBeforeRetransmit(URLConnectionHTTPConduit.java:337)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleRetransmits(HTTPConduit.java:1435)
… 21 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:450)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:317)
at sun.security.validator.Validator.validate(Validator.java:262)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1670)
… 37 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:445)
… 43 more

takbb · June 9, 2021, 2:16pm

Hi @Brotfahrer , I’m pleased to hear you are interested in the XML Easy Reader!

I’m wondering if that error message is through something I’ve done, as I recently updated the component, and the demo workflow but I just had a look and the workflow appears to contain a newer version than the component that I made public!

Let me take a look and I’ll get back to you as soon as I can. Of course it might be some other cause, but chances are it’s me that has broken it since I touched it last… apologies!

takbb · June 9, 2021, 3:46pm

Hi @Brotfahrer, I have updated both the component and the demo on the hub now, so hopefully the link works again (repeated here for convenience):

I’m not sure what happened, but somewhere I think I might have managed to prematurely save an edited version of the demo workflow to the hub. Please let me know if it now works.

Just in case, I have also uploaded it here…

XML Easy Reader demo.knwf (1.2 MB)

I’d welcome feedback, and suggestions. I’ve been trying it out with various publicly available xml files to see where improvements can be made, but there will always be more things I haven’t found, or thought of!

I see this as a “working prototype” towards one day perhaps porting it to a real KNIME Node.