Geofile Reader node does not allow for file browsing

johnvemery · December 8, 2022, 1:54pm

Hello,

When using the Geofile Reader node, I do not see a way to browse for a file on my computer. I am only able to copy and paste the file path into the text box. It appears the Geopackage Reader node is similar.

Can we please add a file browse option for this node? Apologies if I am missing something obvious here!

DiaAzul · December 8, 2022, 3:48pm

I’ve been working with the Geospatial Analytics extension since the release of KNIME 4.7 and my impression is that this should not have been released in its current state. The software has bugs, lacks features such as proper file dialogue boxes and has poor performance. My impression is that it was rushed out for the Austin annual jamboree and so that KNIME and Harvard could boast about their partnership. The management desire to claim victory has screwed the users with poor quality software.

The idea is good, but it should have stayed in labs and not been released as a core extension.

DiaAzul
LinkedIn | Medium | GitHub

tobias.koetter · December 8, 2022, 6:14pm

Hello @johnvemery ,
you are not missing something. Unfortunately the Geospatial file nodes do not support browsing as of now. We are working on the support for file browsing in the Python node development framework and once available will use it in the Geospatial file nodes.
Bye
Tobias

tobias.koetter · December 8, 2022, 7:26pm

Hello @DiaAzul ,

regarding the performance problem. If you are running KNIME Analytics Platform on Windows this could be caused by the Windows Defender. The Geospatial extension is developed with the new Python node development framework which is why KNIME starts several Python processes when executing the Geospatial nodes. In some cases, the Windows Defender scans the Python processes which slows down the execution dramatically. To prevent this, you can add the python.exe process to the list of exclusions of the Windows Defender. To do so please follow the steps as described here and select Process in the last step. The dialog should look like this when adding the exclusion:

(Please note: Adding the KNIME installation folder to the exclusion list does not solve the problem since these exclusions are ignored during real-time protection even thought the official documentation states otherwise.)

Be aware that this will cause Windows Defender to ignore all Python processes which might not be desired if you also run other Python scripts from untrusted sources.

We would appreciate if you would help us to improve this community extension by reporting any bugs and problems but also suggestions for new features and improvements here in the forum.

Thanks
Tobias

DiaAzul · December 8, 2022, 9:55pm

@tobias.koetter

THIS IS A RED FLAG

To suggest that Windows Defender needs to be turned off for the Python executable in a public forum, which will be indexed by Google for eternity, will be the death of KNIME.

It is hard enough for Data Scientists and Business Analysts to convince IT departments to install KNIME. It is even harder to persuade them to allow KNIME to be installed with Python - the Devil’s own programming language, beloved of hackers and all who worship them. It will be impossible for anyone working in a corporate enterprise environment to get approval to install KNIME if there is even the slightest hint that security needs to be reduced.

Especially as it is is not necessary.

Before you suggest adjusting the settings of Windows Defender you need to profile what is happening using the Performance Analyzer. This will tell you to what extent Microsoft Defender is having any impact on performance. It will tell you how long Microsoft Defender is analyzing the top processes, paths and file extensions.

If you run Performance Analyzer you will see that the most time is spent analyzing the cache files generated by Python when you run the code. As this is a once off activity, associated with compiling the code before it is executed, then it has a fixed impact on execution time and doesn’t increase with the amount of data executed by the node.

If you want to improve performance then you can look at minimising the number of times that the code is compiled by Python and scanned by Windows Defender.

Suggesting that security should be reduced is an absolute no-no.

DiaAzul

tobias.koetter · December 16, 2022, 9:04am

Hello @DiaAzul ,
thanks very much for the detailed response and suggestions.

Just to make this clear, the workaround mentioned above is only necessary if you are using a KNIME Extension that is developed with Python (such as the Geospatial Extension) and you are experiencing major performance problems caused by the Windows Defender. This is certainly not a general recommendation we would make for everyone, and we completely agree that in the ideal case, disabling the Windows Defender would not be necessary.

Thanks to your suggestion we did further investigations. Unfortunately we couldn’t consistently reproduce the findings on different machines, since not all machines are affected by performance problems due to the Windows Defender. However on one machine we did find the following: When executing the workflow several times, a lot of pyc files are touched by the Windows Defender. We also noted that the Windows Defender spends the majority of the time scanning pyd and dll files which we can not prevent except by adding the exclusion.

We then tried to analyze under which circumstances the Windows Defender scans all those files. Ideally one would be able to use file/folder based exclusion to define precisely which files to exclude from scanning, but our tests have shown that these rules do not apply for real-time protection as already stated here. We will continue to look into the issue and see if we can find other means to reduce the performance impact of the Windows Defender on the Python nodes in KNIME.

For additional context around use of Python and KNIME together, another challenge that our Python experts are facing on Windows is that starting a Python process takes significantly longer than on Linux or Mac, independent of the Windows Defender. The Python framework executes each node in its own process to prevent unwanted side effects. To minimize the problem, we start several processes which can be used by the Python nodes. However, once they are consumed, we have to start new processes which especially for smaller data sets affects the execution time if many Python nodes are executed in a row. The number of processes is for now fixed, but we might expose this as a parameter to allow the user to tune the tradeoff between execution speed and memory consumption.

One last thing to mention: if you haven’t tried this already, another thing that you could try to improve the performance is to enable the columnar table backend, which improves the data transfer between Java and Python based KNIME nodes. For more information about the columnar table backend and how to use it see here.

Thanks again for all your valuable feedback on this topic and others! We certainly don’t want to run afoul of security best practices, and are sensitive to how our responses might be received by corporate IT departments. We appreciate your patience as we work to improve the performance of Python in KNIME generally, as well as the new Geospatial extension specifically.

Bye
Tobias

DiaAzul · December 17, 2022, 4:05pm

@tobias.koetter thank you for your response.

I am typing up this post (a) because I need to write it down to rationalise things in my own head, (b) because it may be useful information to others.

What you are describing is a reflection of the difference in the way that Linux and Windows create new processes. This difference is well documented, but I will summarise below and add commentary specific to KNIME where appropriate.

The common approach to creating new process in Linux is to fork the existing process. What this means is that execution on the main branch continues to use the variables and memory that it was allocated, and the new process also continues to use the same variables and memory. Therefore, forking a new process is quick. However, if both processes continue to use the same variables and memory there will be problems. So, whenever one of the processes writes to memory to change a variable that block of memory is copied. Therefore, the one process (that didn’t change memory) keeps the original block of memory, and the process that changed the block of memory gets a new block with the changed data. Therefore, whilst forking is quick to start, there are performance penalties if either of the processes writes to memory.

Windows takes a different approach. Whenever a new process is created, it creates a new environment into which everything is loaded. In this case Python interpreter, the python code and all the dependent libraries. However, once created the process runs independently of all other processes. Therefore, whilst Linux forks the process and continues to use copies of the Python Interpreter, Code and libraries that have already been loaded into memory, Windows needs to reload everything as if it was running the code for the first time. This is why you are seeing the compiled Python files (.pyc, pyd, dll) touched (but not necessarily scanned) by Windows Defender. It is a function of how new processes are created on each platform.

There is much heated discussion on which approach is better, and it is better not to get into that debate. What is important is that the approach to implement multi-tasking with Python in a cross-platform environment needs careful thought and planning.

There are two approaches that can be used:

The first, where the task are IO limited (e.g. on a web server) where the task spends a lot of time waiting for data and requires little CPU time to process. This is efficiently handled by the Python package asynco. It’s not relevant to KNIME, which tends to be CPU limited rather than IO limited (in most cases).
The second option is to create an internal server to handle tasks. For the geospatial extension that would be a task runner with a work queue. The geospatial nodes would then submit jobs to the work queue which would then be executed by the server. The server would then have multiple processes that would pull work from the work queue and either update tables directly or return data to the client node. In this case the server creates processes when it is started and only terminates them when all of the jobs have been completed/ the server shuts down.

What you are doing with KNIME/Python is superficially the most straightforward approach, but not necessarily the most performant.

DiaAzul
LinkedIn | Medium | GitHub

system · March 17, 2023, 4:05pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.