Running a PMML file in batch mode

I am new to Knime. I used the pmml writer node to export a model as a pmml file. I want to run this single pmml file in batch mode. However, when I am trying to do that I get the error: "Workflow is locked by another Knime instance". Does anyone know why this is happening and how to fix this?

Also, I downloaded pmml files from the Data Mining Group aongwith the relevant data files (stored as csv in the same folder; http://dmg.org/pmml/pmml_examples/index.html). I want to run the pmml file in batch mode but it shows the same error.

Did you have the GUI still open when trying to run the batch mode? Then the workspace (not only workflow) is still locked. Using KNIME in batch mode to run a PMML model generated by KNIME seems a bit odd, though. Why not make use of the full flexibility of KNIME and execute the full workflow in batch mode?

- Michael

PS: apologies for the late answer - last week's KNIME Summit kept many of us fairly busy...

I think the question on stackoverflow helped me to find a solution for your problem, you can check my answer there. As a short summary: you cannot execute PMML models in KNIME, only workflows (which can contain PMML models). I think in similar cases a better error message would be helpful.

Cheers, gabor

PS: I was also at the KNIME summit (enjoyed very much). Thanks. :)

Hello Michael,

Thank you very much for your reply. I'm aware of the Summit and was expecting some delay, so no need to apologize (I hope that the Summit was a great event and you have enjoyed it). I wanted to clarify the scenario I was using to execute the example workflows, but instead of duplicating it here, I will point you to my corresponding question on StackOverflow, which I posted while waiting for answers on this site. My question and (as of now) two answers (thank you so much, Gabor and Rosaria!) can be found here: http://stackoverflow.com/q/35645896/2872891.

For better clarity, I'm including KNIME's full output from executing my test script below:

Knime: Cannot open display:
Knime:
GTK+ Version Check
CompilerOracle: exclude javax/swing/text/GlyphView.getBreakSpot
INFO     main BatchExecutor      ===== Executing workflow . =====
ERROR    main BatchExecutor      Workflow is locked by another KNIME instance
INFO     main BatchExecutor      ========= Workflow did not execute sucessfully ============
Knime:
JVM terminated. Exit code=3
/knime-full_3.1.1/jre/bin/java
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dknime.enable.fastload=true
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx1024m
-Dorg.eclipse.swt.internal.gtk.disablePrinting
-jar /knime-full_3.1.1//plugins/org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar
-os linux
-ws gtk
-arch x86_64
-launcher /knime-full_3.1.1/knime
-name Knime
--launcher.library /knime-full_3.1.1//plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.300.v20150602-1417/eclipse_1612.so
-startup /knime-full_3.1.1//plugins/org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar
--launcher.overrideVmargs
-exitdata 5c8007
-consoleLog
-application org.knime.product.KNIME_BATCH_APPLICATION
-workflowDir=.
-vm /knime-full_3.1.1/jre/bin/java
-vmargs
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dknime.enable.fastload=true
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx1024m
-Dorg.eclipse.swt.internal.gtk.disablePrinting
-jar /knime-full_3.1.1//plugins/org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar

I was not running any KNIME GUI (explicitly), as I'm interested in running KNIME on server in batch mode only. Moreover, per my understanding, specifying "-application org.knime.product.KNIME_BATCH_APPLICATION" explicitly tells KNIME not to initiate an interactive (GUI) session (assuming that, the message "Knime: Cannot open display:" looks suspicious.

Now, a couple of words in regard to my goals. I thought (before reading Gabor's comment and answer) that it is possible to export the whole workflow (analytical model(s), pre-processing and workflow graph per se) as PMML. I was confused by reading materials I mention on StackOverflow, in particular, related to modular PMML. Anyway, my goal is to allow users to create, modify and sometimes execute research workflows, using their preferred environments (i.e., using KNIME GUI locally, that is, on their desktops or laptops), whereas main execution tasks and workflow sharing to be done on the server side in batch mode only. Since workflow sharing is an important feature, I expected that users would be able to export their workflows (not only analytical models) in a common tool-independent format (which I though PMML is) and share them via server, which, in turn, would allow other users to import those workflows and execute and/or modify them, using their preferred tools (not necessarily KNIME). It seems that PMML currently doesn't offer this functionality. In his excellent answer on StackOverflow, Gabor presented the basic approach on how to create a workflow, exportable as PMML. However, it is seems to me that such workflows would still require KNIME to import and run them. Either that or my uderstanding is wrong and those workflows would be portable across all tools, supporting various open source PMML-based workflows execution environments (KNIME in batch mode, Kamanja, Cascading + Pattern, Augustus, Openscoring / JPMML, R, Weka, Spark). Please advise on these aspects (here or as an answer on StackOverflow). Your clarifications and/or help will be much appreciated.

Best regards,
Aleksandr

Hello Gabor,

Again, I wanted again to express my appreciation for your attention and help with this. I think that I understand your answer on StackOverflow, but please see my reply above to Michael. It adds more detail and clarifies certain things, including my expectations and goals. I hope that you will be able to offer some additional feedback on that as well.

Best regards,
Aleksandr

Hi Aleksandr,

(I am reacting to only last paragraph, I apologise if it is not clear enough or its analogues/methaphors are not perfect or even causes more confusion, that was not my intention.) The workflows and the models have different purposes, you can think of the (PMML) models as an object in an R session describing how to predict (the P from PMML) values based on input data, while the workflow is a whole R session to load the data, do the preprocessing, apply the model and get the results, practically a program. I usually think of KNIME as a visual programming language.

I am afraid there is no universal or widely accepted workflow format. Usually these parts are not even documented (SCUFL2 is a semi-exception: not widely used, but at least somewhat documented) as these are not meant to be reused in other programs, while PMML is meant to be reused. To reuse whole workflows/programs in other programs usually there are ways to do that in both direction. For example you can use KNIME from other programs using its batch execution mode (as you want to use), while you can use other programs from KNIME (in case it does not already have a better integration) using the external tools nodes, so this limitation is usually not a problem.

In my opinion KNIME has a very good support for PMML (especially to create PMML models, but usually it does not have problems consuming that either). Other tools might have a different approach, but the goal of PMML was to provide a way to let different products exchange/combine/use their models, which is very easy with KNIME.

What is and what is not in the PMML models?

In it:

  • preprocessing instructions
  • description of the expected input
  • the model to predict values (and statistics) for potentially unseen input

Not in it:

  • the training data (except for certain models, which consist only of those data, like k nearest neighbours) - as this is not necessary to predict values
  • the input data you want to apply it to (because in that case it were not reusable for other data, usually the data is not even collected when the model was created, so this would be a huge limitation)
  • instructions where to put the predicted data, what statistics to be computed (this would also make it less generic, what if you want to see the predictions in a database for example and not in a CSV file, ...)
  • how the model was created (though some models might give a clue in case there is only a single way to compute the model, but without the training data and the training parameters it is usually not possible to reconstruct the model)

Because of the things not in it, you cannot use the PMML models alone to make predictions, you need further input, especially the input data you want to use for prediction (and as I explaind in my stackoverflow answer for the KNIME question, you also have to load the model, make the prediciton and save the prediction in case you are not just to check whether the input fits the model and this requires configuration).

So, what you can do is the following: generate a KNIME workflow/R script/Python program/... for each data input and PMML model you want to predict values and execute it.

Hope this helps

PS.: Sorry for being long, I hope I expressed my understanding somewhat clear.

Hi Gabor,

I greatly appreciate your detailed clarifications here on the forum. I'm pretty comfortable with R, so I had no problems with reading your explanation. However, the situation is rather unfortunate, even though I believe it is slightly better than what you have described. What I mean by that is that, beyond the Taverna-specific workflow language you've mentioned (SCUFL2), there are various frameworks and domain-specific languages (DSLs) that offer workflow interoperability across various platforms.

An earlier example of such language and corresponding platform is DiscoveryNet, which offers an XML-based generic workflow management language DPML (Discovery Process Markup Language): https://en.wikipedia.org/wiki/Discovery_Net. Recently I ran across another, IMHO very powerful, workflow-focused science gateway framework and platform, called gUSE (see http://guse.hu and https://en.wikipedia.org/wiki/GUSE). Interestingly enough, this framework originated from Hungary :-). This platform seems to be quite popular in Europe (and is part of various EU scientific initiatives), but is relatively unknown in USA, where the most popular general scientific workflow management system is most likely Pegasus (https://pegasus.isi.edu), which is actually a system that we plan to use for our platform, since it is already a part of infrastructural software that we build our platform upon (HUBzero). Pegasus and its workflows are used in various interesting projects - for example it was used in a recent groundbreaking discovery of gravitational waves by LIGO Project. Pegasus is very well documented and offers rich APIs, which gives me hope that, if need, we will be able to develop adapters that would automatically convert workflows from various platforms to a single, more universal format (since most workflow engines use DAGs to represent and execute workflows). Even though Pegasus is currently our primary target platform, I became very interested in gUSE and its architecture and implementation of workflow interoperability for science gateways. There is even a nice book on that, which I'm currently reading: http://link.springer.com/book/10.1007%2F978-3-319-11268-8. The only aspect of gUSE I'm concerned about is whether it is modular enough to extract or use its workflow functionality (with adapters/bridges) without relying on its portal (portlets) software, which we don't need (at least, in current incarnation of the platform that we are developing). Therefore, most likely the fastest (but not easiest) way to implement these ideas is to use the Pegasus infrastructure.

Frankly, considering a generic nature of XML-based PMML format, I think that it would be a great idea to embed workflow description and execution information into PMML, similarly to how pre-processing and ensembles information is already embedded into PMML. If feasible, that would be of a tremendous value, allowing to use PMML as not only universal format for analytical models, but also for complete workflows. The resulting transparent interoperability between various scientific and data analytical platforms will significantly improve efficiency of collaboration in science and industry, leading to faster, cheper and more significant scientific discoveries and business achievements.

This is my current thinking on the topic. If you and/or other people from KNIME community have any comments, suggestions, advice and help in this regard, it will be much appreciated.

Best regards,
Aleksandr

P.S. Sorry about a long reply, but I felt it's important (and potentially benefitial for everyone) to share this info and my thoughts with you and the rest of the community.

Hi Aleksandr,

I have heard about guse before, Luis de la Garza presented the Generic KNIME nodes and mentioned they can use guse with KNIME, so in case you settle -at least partially- KNIME, it might be a good idea to check his research too. Maybe similar approach can be applied to pegasus or other solution you are evaluating too.

Kind Regards, gabor

Hi Gabor,

Glad to hear that you're aware of gUSE. And thank you for the references - will definitiely check out Luis' research and his open source project(s). Hope to stay in touch.

Best regards,
Aleksandr