KNIME to Python exporter

Friends, I’m happy to share with you my new open-source project – knime2py.

GitHub - vitaly-chibrikov/knime2py: KNIME project to Python workbook converter

If you work in KNIME you know how fast new prototype of DS project can be created in it. And, at the same time, how complex already developed KNIME project can be. But maybe you, as I am, fill the lack of ability to export your KNIME project to .py or .ipynb files.

So, I decided to start my new open-source project “knime2py” which is a code-generation and KNIME→Python exporter: it parses a KNIME workflow, reconstructs its nodes and connections, and emits runnable Python “workbooks” (Jupyter notebook or script) by translating supported KNIME nodes into idiomatic pandas / scikit-learn code via a pluggable node registry. Alongside the executable code, it also writes a machine-readable graph (JSON) and a Graphviz DOT file, preserving port wiring and execution order so the generated Python mirrors the original workflow.
The list of already supported nodes can be found here: knime2py — Implemented Nodes
Even if the node you use is not implemented, exporter creates cell for it with initialization of parameters from the node’s settings.xml.

Fill free to try it. I will be happy to discuss any questions.

5 Likes

Friends, I have got the first results.
I have exported with knime2py this project:


into .ipynb file.
ROC in KNIME is

And ROC from workbook is

With no manual edition of the generated code.

1 Like

Hi @VitaliiKaplan , thanks for sharing this. I’ve not had a chance to study your project yet but the idea is fantastic and I hope to be able to take a closer look soon and give some feedback, as I’m sure will others.

1 Like

This is a great project! I love the idea of making KNIME more portable :slight_smile: Hugely admire the effort & skill required to be able to pull this off.

I currently work around the “portability” aspect by pushing most of my transformations to DuckDB inside the DB nodes & therefore having an exportable SQL query that can be ran in any environment. Having some of the remaining functionality “exportable” to code would be hugely beneficial to my workflows.

I have a few suggestions /potential considerations for future development. Happy to elaborate on any of this & chat in more detail if needed!

  • focus on the I/O

In addition to the simple etl+ ML nodes, I’d probably focus the efforts on the more exotic I/O, (a lot of the extraction/ingestion can be done by using duckdb’s jdbc & its extensions already).

I believe a lot of the connectors could be transpiled to e.g. their dlt equivalent https://dlthub.com/ .

  • consider limiting pandas in favor of SQL (e.g.duckdb) or Ibis.

While pandas is very popular and a lot of the codebase (& skill) exists in the market, I don’t believe it should be the default “target” in this day and age (at least for the simple ETL nodes) for scalability reasons (single threaded, eager execution etc)

Since it’s still a “greenfield” project, I would opt for either transpiling the operations to SQL or IBIS https://ibis-project.org/ to be able to use a better execution engine for the job and seamlessly apply the code to a different environment as the data scales (e.g. duckdb for single-node processing, spark-sql/pyspark for distributed processing). SQL itself can be nicely ported between dialects using SQLGlot (which Ibis also uses under the hood) sqlglot API documentation

There is also an interesting project called amphi https://amphi.ai/ where the author implemented visual to pandas code interface as well. I also know he was considering using ibis/ duckdb as a default back-end to benefit from lazy execution & multi-threading . Of course KNIME is way, way ahead in terms of the maturity & visual development functionality.

Wishing you all the best in your development efforts! :slight_smile:

1 Like

amazing idea! I love it :heart:

I recommend running your program with uv, which is equivalent to a standalone tool.

I’d definitely give it a try later.

Since users may want to verify that the converted code is consistent with the KNIME workflow, I suggest creating a node demo in KNIME that directly references the converted code (but mark it as not converted in future), maybe a simple python node to reference the generated code is enough. Then, we can compare the final results in KNIME (this refers only to the table results, not the graphs) with testing nodes.

The user may want convert part of KNIME workflow, maybe specify a node number, and start to convert from that node is a useful feature? then we can easily replace part of nodes with python code :star_struck:

2 Likes

Hello,

Thank you for the comments. It is a pleasure to see interest in this topic. As the project has gained its first users, I decided to focus on the deployment/delivery process. I have prepared a stable version and published a Docker image. I realize not everyone uses Docker in everyday practice, but a container image is a reliable way to eliminate dependency and version issues. OS-specific packages will follow.

If you have the Docker daemon available, you can run:
docker pull ghcr.io/vitaly-chibrikov/knime2py:latest
docker run --rm ghcr.io/vitaly-chibrikov/knime2py:latest --help

An example start script (k2p_docker.sh) is here:

Set the two variables to your paths and run it:
WORKFLOW_REL — path to your KNIME project
OUT_REL — path to the output directory

Feedback and suggestions are welcome.

@HaveF, thank you for the ideas. It would be great to pass the outputs of knime2py to the KNIME Python Script node. Even more — the UI you mentioned could itself be a custom KNIME Component. Imagine adding a component node (I will try to create it), configuring it with the path to the knime2py package (downloadable from GitHub), receiving the Python equivalent of your project, and running it within the same KNIME workflow. It sounds fantastic and is definitely worth trying.

@Add94, thank you for the detailed comment. If I understand correctly, you are proposing to translate KNIME node data transformations into SQL (or SQL-like) transformations. Indeed, when a node has a tabular input and output, its effect can be represented as an SQL query over the input whose result is the output. I had not considered it from this angle; it is very interesting.
knime2py has its own internal representation of a KNIME workflow. It reads the project’s configuration files, constructs the node graph, and transforms it into linear Python code. In principle, the same graphs could be translated not to Python but to SQL—i.e., a “knime2sql” backend. I do see challenges, especially with data-science nodes and operations that use randomness, but it appears feasible.
As I understand, you are suggesting Ibis as the intermediate representation. So the pathway would be: “KNIME → Ibis/SQL IR → DuckDB/Spark/Postgres.” Right?

2 Likes

Thank you for the reply! And mostly, that’s right! With the difference that I believe the representation can still be python, but the package used for data transformations can be swapped from Pandas (which I understand is currently used in your prototype) to DuckDB.

You’d still keep the linear python transformations but those would be expressed in SQL. DuckDB is a much more powerful transformation package so I believe it would be a more reasonable default (+ thanks to its SQL api, a lot more portable to other environments).

In the sample below, you can reference previous dataframes but the query is not executed/materialized until the “show()” query.

“import duckdb

r1 = duckdb.sql(“SELECT 42 AS i”)

duckdb.sql(“SELECT i * 2 AS k FROM r1”).show()"

This feature (lazy eval), plus multi-threadding & a few other components make duckdb 100x times more scalable vs pandas. DuckDB also has a lot of extensions (both core and community) that would probably help further extend the functionality.

Ibis is also just a python package that has its own syntax but can execute queries using multiple back-ends (incl. duckdb, spark & others). so once it’s represented in Ibis or DuckDB python api (duckdb being a more reasonable default imo), no further transpiling work is needed, as users can then apply the code output in any environment that can run python.

So the pathway for the package would rather look like :

"KNIME → DuckDB Python API (which has SQL syntax)”. - for data transformations + common I/O (parquet, json, csv, S3, SQlite + more) : Data Sources – DuckDB

“KNIME → Python Dlt package” - for exotic data connectors

“KNIME → python scikit-learn/ other data science packages” - for other data science/ML functionality

and so on.

Hope this helps! :slight_smile:

2 Likes

@Add94, thank you for the explanation. I now better understand your ideas.

I think k2p is at too early a stage to implement this feature. I don’t think it’s possible to remove pandas from the generated code completely, it’s ubiquitous, and not all KNIME nodes can be transformed into SQL.

This could be implemented as an additional stage during graph processing. If an “island” of consecutive nodes is purely relational, it can be replaced by a generated SQL block executed via DuckDB. That code would substitute for the pandas code in those regions. As a result, we could emit two .py variants: one using pandas and one using DuckDB.

But this is for future plans. Right now I’m going to focus on usability, UI, and node coverage.
Let’s back to it in the coming months.

2 Likes

Friends, I have added UI to the project.
Detailed description is here

2 Likes