Productizing KNIME Workflow in "Less Weird" Environment

My engineering team want to productize my KNIME workflow in their production environment. They insist that the workflow cannot continue running on KNIME as it is too weird and the engineers don’t know how to support it. They support native Java along with Groovy scripting. They also support ways to run Python and R as a convenience with limited multicore (though this could change). They sell a SaaS service and their distributed application was “born in the cloud” but they hate the idea of having a KNIME Server respond to REST calls. Even the word “KNIME” has become toxic.

I’m trying to look at this decision in a positive way. My workflow is huge (2000 nodes) and I’m hitting scalability problems. A change in direction was required anyway, though I was hoping to migrate more towards KNIME Server. I need to come up with a new architecture.

My question is, what can compete with KNIME?

I need to have a platform that:

  • can migrate my custom Java nodes
  • automatically handles the parallelization and dependencies
  • can split tasks out across many cores and many machines
  • is easy to hunt through data looking for problems
  • can pause, go back, and continue running from a midpoint
  • is easy to monitor and manage

What platform would you recommend? Or what advice would you pass on to my engineering colleagues?

I’m really worried about this migration and am skeptical about moving away from KNIME. Many data science models can be productized relatively easily after the tuning parameters have been found. But for a more convoluted data science application (like mine) KNIME seems like a solid platform.

Here are some random thoughts concerning the requirements listed above:

Migration: The NodeModel’s I’ve developed for each of my custom KNIME nodes are heavily dependent on the KNIME libraries. Even keeping the Java, having the same logic also run in a different environment would probably require a completely independent repository. I don’t think there is any way to cleanly generate KNIME nodes and, say, Python functions off the same code base.

Parallelization and Dependencies: Apache Spark looks like a promising way to handle the scalability. But it also looks like a steep learning curve and a difficult migration path from KNIME. I can use little of the MLlib and would probably have to develop the low level parallelization that MLlib takes care of for my own custom nodes. This could be hard!

Distributed Data Debugging: This is what I’m perhaps most worried about. I’m very productive with KNIME because I can spot problems in the results and quickly trace them back to the source. Then experiment, fix, and try again from a midpoint in the workflow. I can’t see myself being nearly as productive trying to debug a data science application from just the source code.

Monitor and Manage: It’s not just my engineering colleagues that need to support this, but also the customer support team. They just want to see a row of green lights, and a red flashing button they can click when there is a problem. KNIME can quickly add such functionality to a distributed application. But for anything else, all this needs to be manually developed.

Migration is a common problem. Data Science isn’t an end to itself as the insights always need to be productized. KNIME is a great way to generate insights, but productization often involves other considerations.

I’d appreciate any perspectives on migration, as well as arguments I can take back to my engineering colleagues on why they should support KNIME.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.