Webinar: Spring Summit - Workflow Doctor Sessions

heather.fyson · May 4, 2020, 3:00pm

Hi there,
We set up this thread for questions that were collected during the Workflow Doctor sessions of the KNIME Spring Summit - Online Edition. To make it easier to find things, we’ve grouped your questions and answers into categories.
Click the arrow, marked here in the screenshot circled in the blue, next to the question to expand and show the answer.
2020-05-14_13h58_01
Enjoy! And post any missing questions you have here too!

KNIME Analytics Platform

Components & Metanodes
Nodes
Workflows
Performance
Connectors
Text Processing

Components & Metanodes

What is the difference between a component and a metanode?

Components are blue prints that can be shared. Components allow you to create templates or blue prints to share. Components really encapsulate their internals, too. Metanodes allow you to organize your nodes in a workflow. (A metanode only hides a part of a workflow as a means to organize more complex workflows.) This blog article has more information https://www.knime.com/blog/knime-analytics-platform-40-components-are-for-sharing

How can you check the input for a component

You use the Configuration nodes for that, e.g. “Column Filter Configuration” or “Table Validator” for example.

Can I create a component and make it available as part of the Node Repository?

Components can currently only live in the workflow repository.

Will there be an ability to organize the configuration panel of the component. Right now, the first added "configure node" will be first in the panel. If you have several configuration nodes it would be great to have the ability to order those config options (not automatically by time of insert to the component).

There is currently a ticket out to address this issue.

How does KNIME support versioning of components? This becomes especially important when the “API” of a component changes which would break a workflow using that component. How can automated updates be prevented, e.g. bind a workflow to a particular component version?

A component always has one status that is the current one and that will be used to update linked components. If this is not desired, it is best to save components with a different interface under another name. After dragging a shared component into a workflow it can also be disconnected via the context menu to avoid accidental updates.

When you’re using components, is there a mechanism to prompt the user to choose an option that sets a variable for a workflow each time it is run? If you have a component that’s reading from a database, every time it’s run I want the user to select a piece of information that is used as a variable in the component, e.g. month. Will the user get a prompt to do this, or will the component just run using the last selected variable?

If you want the user to enter a variable, you can add a Breakpoint node that throws an error until a valid input is given.

Is there a mechanism to prompt a user to make a selection for a variable every time a flow is run? Like a configuration option for a component that will prompt a selection every time a downstream node is run?

Answer:

The Find node function in the Node menu is really useful for large workflows. Does this still function even when the node is inside a metanode?

Currently it only works in one workflow-level, but we are working on improving the search!

Nodes

Do the interpretability nodes work on regression nodes?

Yes, in fact they are more naturally suited for regression models. In case of classification, they work on the predicted class probability instead of the predicted classes themselves.

Is there a way to syn installed nodes between different computers?

The server allows you to share “preferences”. This functionality can be used to sync on the list of updates and update sites. More detail is provided on this in this blog article: [https://www.knime.com/blog/simplify-operations-with-knime-server-management-services-for-knime-analytics-platform]

How do you open the search of the name for the node?

It’s Ctrl-F or Apple-F

Is there a way to use a flow variable in an expression inside nodes to do things like change an expression dynamically to connect to different tables for data import?

You can change expressions dynamically by editing the flow variable, e.g. with a String Manipulation (Variable) node. Then you can use that expression in another node.

Why is the concatenate (optional input) node now deprecated? It’s one of my standard nodes. How am I to work without it?

This node is deprecated because now the normal Concatenate node allows you to add new inputs dynamically.

I often find myself cascading multiple string replacer or string manipulation nodes together. Is there a way to apply multiple manipulations to multiple columns in a single node?

The Column Expression node would help here.

In "Column Expression" column names are not sorted. That makes it difficult to find out the particular column for bigger lists.

You could insert a Column Resorter node in front of the Column Expressions node to sort the columns lexicograpihically by name. Here’s the link to the Column Resorter node on the KNIME Hub: https://hub.knime.com/knime/extensions/org.knime.features.base/latest/org.knime.base.node.preproc.columnresorter.ColumnResorterNodeFactory

Is there a node to connect an EC2 instance using PPK/PEM file and execute Shell Script?

Try the SSH Connection node, which you can use to connect to an EC2 instance. Note that this is only for downloading or uploading files.

Is there API functionality to remove options from the contextual menu in Workflows and Nodes..e.g. Remove copy / paste selections?

You can add options but it is not possible to remove others, as they are provided by other plugins that might even be loaded after yours.

Will more nodes have the optional input function (e.g. Joiner)?

With 4.2 the ColumnAppender and Merge Variables will have those ports. We are planning to revise the Joiner and are considering adding this option.

Is there a way to espace the ' (apostrophes) when reading the data from a csv file? When there is ' s in the cells the reader gives error and the data is not read properly.

This is possible with the File Reader - click the “Advanced” button and go to “Quote Support”.

How can you define specifications for table reading to avoid the error where it guesses at the data type on one iteration and then it errors on the next iteration because the data is different?

In loops it is best to use the CSV Reader instead of the File Reader node.

Will it be possible to refer a column based on flow variable value? E.g.I have a column name as a flow variable value and I want do some string manipulation on this column using the flow variable.

You can use the ColumnExpressions node, e.g., column(variable()) + 5

Is there a way to extract XML schemas from one file and map other data with it ? In Talend, I learned to do but in KNIME, I couldn't find any option to do it!

If you mean you want to extract info from XML, you can use the XPath node for that.

Any update on Catch Error in looping nodes since last release?

Are you maybe referring to this problem here: [BUG] Active Scope End node in inactive branch not allowed.. With the next release 4.2 there is going to be a straightforward workaround by wrapping potentially failing nodes into a component and inserting that into a try-catch-block.

What convex optimisation nodes (solvers) are suitable for optimising objective functions at the observation (row) level

We have a couple of nodes for optimization in our Optimization extension (e.g. the Parameter Optimization Loop) which can be used for such tasks but these are not tailored to convex optimization and therefore don’t exploit the convexity. However, you can always use a Python node to use a Python library such as CVXPY to supplement functionality that KNIME doesn’t offer in the moment.

I am a newbie to KNIME. How do I build a machine learning model on a big file say about 1GB? Should I slice it into different small files and build models and then aggregate the results? or is there a different way to do this?

When training a model it’s often good to use more data. If it’s not practical or desirable you can randomly sample your data with the Row Sampling node for example.

Can ARIMA node accommodate multiple predictor variables? If so, how?

The current version of the ARIMA node does not support multiple predictor variables but there are extensions to the ARIMA algorithm that do allow multiple predictors and you could reach out e.g. to Python to access those algorithms.

How much of adaptation of the data organization (order and types) has to be applied to a set of data when one wants to use an existing project?

This depends on the workflow itself. A lot of nodes allow you to work with changing column names etc. (e.g. using regular expression or type-based column filters) but this will require some more work when creating the workflow.

Performance

Do you need to tune or configure KNIME to make it significantly faster in KNIME 4.0 or is it automatically tuned to be faster than KNIME 3.7? (So that we don’t need to do any manual configurations in same specification hardware)?

Most benefits are part of the standard settings. Remember to update your memory settings in knime.ini – more memory = faster runtime.

I use a Mac. Can we say that Mac and Windows versions of KNIME are equally the same (in terms of performance, UX…)? I’d like to know if it’s worth installing Windows on Mac Bootcamp.

Performance is the same. I am on Mac and what I sometimes notice for Windows users that Windows starts a security check and this slows down the performance. Windows Defender and similar tools can heavily degrade performance by a factor of up to 10. Carefully consider adding an exception to your knime.exe process or KNIME AP installation folder if you are on Windows.

Do you have any hints or strategies on how to avoid Java Heap Space errors in KNIME AP?

If possible, allocate more heap space to KNIME AP via the -Xmx parameter in the knime.ini
(experimental) Activate String deduplication by putting the line -XX:+UseStringDeduplication inside the knime.ini
Some nodes (e.g., Sorter, Duplicate Row Filter) provide an option to do computation in memory. Be wary when using these on huge tables
Consider setting the Memory Policy of tables that generate huge tables to “Write tables to disc.”
If all else fails, play around with alternative in-memory table caching strategies. Currently, KNIME AP ships with two such options: -Dknime.table.cache=LRU and -Dknime.table.cache=SMALL. The former attempts to cache any tables in memory. The latter attempts to cache only small tables in memory. LRU will lead to much faster workflow execution in average, but using SMALL can take some load off of your heap space.
Inform us on the forum if you get a Heap Space error so we can look into what we can do about it.

Is it possible to filter flow variables in the middle of the workflow? Maybe this could free some memory space, and make configuring big workflows easier.

There is no “Remove Flow Variable” node. If you needed it desperately you could wrap the part that creates the clutter into a component because that allows you to define the scope of variables defined within. You cannot actually filter flow variables. You can however generate the flow variables inside a component and not let them out.

Connectors

Is there a way to connect to samba fileshare with a given credentials?

This is only possible with the node “Download Files with Authentication”

Can you write a PLS node? I currently have a workaround by using an R node

Could you elaborate on what PLS format you are referring to? Because I googled it and did only find music play list format.

Is there any option that we can actually transform SAP ERP scripting to automate process through KNIME nodes?

I’m not familiar with SAP ERP scripting but I know that we do not have a dedicated node for that. Whenever this is the case I suggest to either look if you can reach out to the functionality via REST using the KNIME REST nodes or via command line using the External Tool node or via one of the scripting nodes (Java, Python, R) if you can write some code snippets.

Is there now any way how I can access a lot of data (10 GB) I already using in one workflow from another, without upload all data to HDD into a table.

You could access the file from AWS/Azure storage e.g. Try ORC and Parquet formats. And make sure to use the partitioning properly. I can point you to a couple of references.

Text Processing

Hi, are you planning to expand the field of spatial analysis?

Try having a look at this extension https://hub.knime.com/samthiriot/extensions/ch.res_ear.samthiriot.knime.shapefilesaswkt.feature/latest

Is there a node that can anonymize data? let’s say we need to anonymize the data before we process it.

Yes, have a look at this blog post, which describes Redfield’s Privacy Extension Further reading: https://www.knime.com/blog/data-anonymization-in-knime-a-redfield-privacy-extension-walkthrough

Will OCR(Optical character recognition) capability be available? Our data source sometimes comes as a scanned pdf format.

We have some OCR workflows on the hub: https://hub.knime.com/search?q=ocr which make use of this OCR node: https://hub.knime.com/BioML-Konstanz/extensions/org.knime.knip.tess4j.feature/latest/org.knime.knip.tess4j.base.node.Tess4JNodeFactory

How often are the community extensions updated in the update site? Would my extension be updated in I change it every week?

Community extensions are released together with new KNIME versions, but we do provide nightly builds.

Your text minig tools are awesome but do you plan to include NLP tools to the mix?

Depends on what counts as NLP You can do a lot of cool stuff with text and KNIME Deep Learning: https://kni.me/w/bbKToLKk1-kyBxgS.

Will there be a support for AWS textract as well apart from AWS comprehend?

Currently we have no plans to add it natively, but you can use it via the Python nodes!