Python Extension Development (Labs) release

carstenhaubold · June 15, 2022, 4:17pm

Hello KNIME Python community and friends ,

We are glad to announce that starting with the KNIME Analytics Platform 4.6 release you can develop KNIME nodes in pure Python!

To get started, you just need to configure two small YAML files. Then you can develop KNIME Python node extensions in your favorite text editor or IDE, and even debug your code while your nodes are running in the KNIME Analytics Platform. See the documentation for details and a tutorial.

We are curious to hear your feedback and see what you will do with those new possibilities!

Features included in KNIME AP 4.6

Defining nodes in Python
- configuring input and output ports (tables and binary ports for now)
- defining node parameters with autogenerated dialogs
- implementing node configuration and execution
- defining views
- accessing flow variables
- reporting progress
- setting warnings
- use Python logging to write to the KNIME console
Defining node repository categories in Python
Bundling an extension into an update site that can be shared
- packages the full conda environment, such that the nodes work out of the box when the extension is installed

Features that will be coming soon

Sharing Python extensions via the Hub
Defining PortTypes
Bundling of pip packages
…

Please let us know what you are missing most.

Documentation

KNIME Pure-Python Node Development Guide (includes a tutorial)
KNIME Python Extension Development API Documentation

Disclaimer

KNIME Python Extension Development (Labs) is considered a preview and is currently under development. The API may change in the future. It is not recommended to be used in a production environment.

Feedback

We are looking forward to receiving your feedback. Please reply here with any ideas, comments, or questions. If you run into any problems please follow the Bug Reporting Best Practices.

DiaAzul · June 15, 2022, 4:35pm

Thanks,

This took me by surprise as I was expecting most development to sit within the Python node.

My biggest concern with the Python (and to an extent R nodes) is the ability of third party components to pull in packages via pip/conda without any user validation. Are you planning a security review to establish best practice to prevent harmful packages being imported into KNIME? First defence would be explicit authorisation before any packages are added to the KNIME installation via any of the package managers. If you are planning to only permit packages bundled with the node, how are you going to ensure that security notifications and updates are propagated through the system?

What you are doing is great, but given the sensitivity of data that is processed in this environment security has to take a much greater role in the development and deployment of features.

DiaAzul

DiaAzul · June 16, 2022, 1:11pm

I’ve had a chance to work through the tutorial and have a few comments. Most of them are opinionated, so I can quite understand if people have a different view.

Packaging/Build/Project Structure
The approach set out in the tutorial for laying out the project doesn’t match up with Python (Pip) package development. There is too much duplication of information across files and it makes it very difficult to work out what to place where.

I would prefer that all configuration information is in the top level of the directory, preferably in a pyproject.toml file and the build system reads configuration information from that. Having building information in the same directory as the code is messy.

It would also help if we could download the knime-extension package using pip (either from Pypi or Github) so that I can use Poetry to manage packages - this means that I can create separate dependencies for development and packaging. Generating an environment.yml for build is straightforward.

Testing
There is no obvious way to do testing or introduce breakpoints into the python code for debuging purposes. It would be good if you can provide some examples of how you test and debug the nodes. Continuously restarting KNIME each time a change is made and then trying to work out why things fail is tiresome.

VSCode Tooltips and typing
I appreciate that it is early days, but the tooltips that come up over classes and functions could do with more detail and provide more explicit information on data types. The information in the example for column.ktype gives a type of unknown, whereas all of the knext.int32() and knext.int64() give the same PrimitiveType. It would be nice if we could test for type using isinstance which is more consistent in Python.

Shortened module name
This is really nit-picking issue. The choice of knext to shorten the module name can read as k_next and kn_ext in English, this is cognitively disturbing. I appreciate you have a lot of documentation with knext, however, it may be better if it was shortened to kext, which would also make it consistent with ktype.

Lambda rather than a method for filters
This is another little nit-picking issue, but you used an instance method in your column filter rather than a function. This is causing VSCode to complain about type mismatches. It may be preferable to use the following:

is_numeric = lambda column: ( 
    column.ktype == knext.double()
    or column.ktype == knext.int32()
    or column.ktype == knext.int64()
)  # Filter columns visible in the column_param for numeric ones

Conda
What is KNIMEs approach to Conda, more specifically the difference between Anaconda and their commercially licensed repository and conda-forge? I’m not an expert on Conda, so it would be helpful to understand whether you have a policy in the area of package management.

Overall, this appears to be a simple way for developing Python packages. The difficulty is in the testing and packaging, which is weak at the moment. But willing to give this a go as there are many Python packages that would work well in a KNIME workflow.

Good work, look forward to seeing how it develops.

carstenhaubold · June 17, 2022, 2:44pm

Hi @DiaAzul, thanks so much for the detailed feedback! I’ll address a few of your points below:

Security
You are right, third party components could potentially include harmful conda packages. But they could also contain harmful script code without any dependencies. The same holds for Java KNIME node extensions. What I am getting at is: our users always need to check whether they trust the authors of the extensions they install. To ease this, KNIME offers trusted community extensions for Java extensions. We do not have the infrastructure set up yet for sharing KNIME Python extensions, but by default any extension written in Python will count as “untrusted” just like other extensions. We have to make a tradeoff between security and extensibility here, so we are offering different levels of “trust”. Does that make sense?

Packaging
Indeed, the structure of a KNIME Python extension is different from that of a Pip package. The reason is that KNIME Python extensions do not make sense standalone, they will only be run from KNIME, while Pip packages are meant to be installed and used in Python environments. If you develop a library that can be deployed via Pip, making it available in KNIME requires you to add KNIME nodes wrapping the functions of your library. We are only suggesting a structure for those KNIME wrappers, not for the Python library itself.

Adding knime-extension to Pip is a good idea!

Testing
Good point, right now there is no section about debugging in the documentation. However, it is easy to attach a remote debugger (e.g. debugpy for Visual Studio Code) to the Python code. We’ll add that to the docs!

And you do not need to restart KNIME for each change in the code if you enable debug_mode: True in the config.yml. Restarting only required if you add nodes or their input and output ports change. Changes inside the node class will be reflected at the next time the node is configured or executed - for dialogs to work you sometimes have to re-drag the node into the workflow. We were also annoyed by restarting KNIME so we added the debug_mode option

Conda
Anaconda’s licensing strategy is currently something along the lines of: they allows you to use their packages for free if your company has fewer than 200 employees. Because we know there are larger companies using our software, we completely rely on packages from conda-forge which do not have such restrictions. And I would suggest extension developers to do the same.

Thanks again for the detailed and constructive feedback! Those are good points you are making, we’ll take them into consideration!

DiaAzul · June 17, 2022, 3:26pm

Hi @carstenhaubold, thanks for your answers. If I may be permitted a couple more comments:

Security
I agree/understand with everything you have written, however, Python packages are another level of security risk compared with Java libraries.

My concern stems from the basic principle that most users will be unable to assess risk associated with supply chain attacks and, even if the can, the amount of information available through KNIME is insufficient to identify the extent of any risk. Most of the detail of packages is hidden (names, version, etc.) and KNIME is capable of installing packages without asking permission from the end user. If I open the KNIME Application where do I get a list of all the Python packages that have been installed and their version? When a security risk in a Python package arises how is this proactively communicated to the end user so that they can take action?

I can understand the principle of approved and non-approved extensions, however, in both cases KNIME needs to be clear and explicit (and have tools available) to track and communicate risk and alert users to take action when issues are identified.

It’s a process rather than technical issue that I am trying to raise. One I don’t feel that KNIME is on top of yet - though I am happy to be wrong on this.

Packaging
I understand everything that you have written, though I am coming from a different direction. I agree that KNIME packages are not Python packages, however, the majority of Python developers will have some exposure to developing Pip packaging and the expected layout of files and setup of a development environment. If you do something different it becomes a cost to the developer and a barrier to developing Python extensions (and I am sure you would like lots of people developing great extensions).

My point is to keep things consistent with the rest of the Python ecosystem to make it easy for developers. We’ve only got a limited number of brain cells, I would prefer using them to pursue things that bring me pleasure rather than yet another programming environment.

Testing
Thanks – finally worked that out. May come back to you if I can’t set a breakpoint in VSCODE for debugging.

Conda
This ties in to the security issue. Part of Anaconda’s value proposition is that they provide a level of assurance that the supply chain is secure. I would expect that some customer’s would value that and prefer that KNIME pulled from the default Anaconda channel rather than conda-forge. It’s easy to say there is a low risk of a security threat through a Python package in the conda-forge channel, and you might expect that if Anaconda detected a risk it would be pulled from conda-forge, but sometimes people feel more comfortable if they can point to a license which provides assurance.

What I am trying to stimulate is discussion about an issue people might not have considered that needs exploring before a malicious actor launches an attack rather than after.

Have a great weekend.

system · September 15, 2022, 3:27pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.