List of feature requests from Alteryx

Hi Everyone!
I am a recent convert from Alteryx to Knime. I really like this tool and compiled a list of feature and usability improvements that could make it even better. I think something Alteryx does well is allow you to build ETL flows with less “clicks”. Many of these feature suggestions would allow a user (like myself) to build ETL flows even quicker! My apologies in advance if some of these features may already be available. I watched the online videos, read Alteryx to Knime document, and did some searches throughout this community but could have missed a few things. Let me know if you agree/disagree or have any suggestions on if these features may already be available!

Feature request list
Is there a way to see the feature requests that are already in progress? I see in other posts feature tickets created, but not sure if there is a way to see when or on which interation update they will occur.

Data explorer
I know I can “ctrl + F” to search data within the data explorer, but a configurable search bar, as well as the ability to search within columns would be extremely helpful.

Also noted in node monitor comment below, but it would be great to make some of the data explorer functionality built into node monitor. It would be great to be able to have a search box in node monitor to explore certain data.
Included screenshot of Alteryx example.

Node drag and connect functionality
Super small feature, but it would be nice if you can drag a node in the viscinity of another node and auto-connect output to input. Those who have used Alteryx probably use this very frequently. Saves a few clicks on drag and drop and then connecting output to input. I know you can also click on a node and then double click on the next node and it autoconnects, but this feature would be great as well!

Data records near node when running workflow
Another small feature also noted in a previous suggestion within the community. When running a workflow in Alteryx, I can see the rumber of records near every tool so I can see the flow of records and where data may be getting lost. I know you can currently hover over the output of the node to see record count, but having the ability to see next to a node makes summary analysis much easier when analyzing your flow (pinpointing where data is lost rather than hovering over each node output. Alteryx example below.
alteryx_recordcount

Node monitor window improvements
Output arrow click functionality to node monitor - For nodes with multiple output arrows (row splitter, joiner labs, ect) it would be nice to be able to click on each output arrow and see the updated values in node monitor. Currently, you need to select the port monitor in node output to select the output arrow to see. Maybe there an be updated functionality where clicking a certain output arrow auto-select the port # dropdown in output.

It would be great to have a search bar included in the node monitor to easily search data with in each node’s output. I frequently find myself search for certain data elements after joins to see what is not matching. PERFECT scenario would be to even include search feature by column.
Alteryx example below.

Configuration window pinned to canvas
I know others have posted the desire for this as well, but I think it is worth another mention. When building an ETL flow, it would be great to have the ability to just click on the node and see a config window pinned to the canvas instead of having to double click/right click and open a new window. Alteryx does this well and it is easy to follow your workflow and adjust nodes from one to the next.

Assign hot keys to certain nodes
I know you can update certain keys in preferences, but it would be great to have the ability to program certain nodes in keys. For example, in alteyrx, I frequently use the browse tool (data explorer equivalent). There is a key for “ctrl + B” which I found myself using all the time to quickly add the tool (node) after my current output. I think adding key programming for nodes could make building flows significantly faster.

Group by / Pivoting tool
For both the Group By and Pivoting nodes, I do not see the ability to change the group order, pivots (column order), or aggregation method order. I saw requests on the community in 2019 and I think there was a feature request sent back then. I think a simple up and down arrow in each component section would solve this. Included screenshot shows Alteryx arrows for reordering in summarize tool (equivalent to groupby).

I will make additional posts as I see more opportunities, but I hope these feature requests are helpful! I think they all would result in less “clicks” and the ability to build and manage workflows more quickly.

Thank you,
Nick

14 Likes

Hi @namoroso,

Thank you for all the fantastic feedback. The timing actually couldn’t be better. As shortly mentioned during our last virtual summit in November, we’re currently looking into larger UI/UX improvements to KNIME Analytics Platform, so our project team was more than happy to read through your post. Please, let us know any further feedback - that’s super helpful and highly appreciated.

Have a nice rest of the day (or start or end :-)).

Christian

9 Likes

My suggestions:

Please get rid of the configuration window priority. If I have a config window up, I should still be able to interact with my workflow (like open tables or add additional nodes). This is my #1 pet peeve. :stuck_out_tongue:

Auto Streaming by default outside a component. Knime should be able to see when two connected nodes are streamable and just stream them without having to put them into a component. Maybe make the connection line appear Blue as an indicator of streaming?

Change the node description UX to point to the KNIME HUB (if connected to the internet)!!! The current node description window looks ancient, and the HUB is so glossy and clean. :wink:

Add keywords or tag search functionality to the node search. For example, Concatenate is Union in SQL. I should be able to find the Concatenate node by typing Union in the node search box. Moreover, if a node search does not return any results, it should generate a link to search on the HUB or in these forums.

Right-click quick-connect: I should be able to right click on a node or port and get a popup selection box for what nodes to add next by either port type or most frequently used.

8 Likes

I strongly agree with the config window dialog being modal. Makes it very difficult for example to copy&paste things from other KNIME workflows (other config dialogs) like parts of python scripts.

Having said that the python script editor is pretty poor. Not too big of an issue but what should be added is automatic conversion of tabs to spaces so that one can get rid of the annyoing inconsistent indentation errors. (yeah a bit off-topic but on some level belongs in the UX category)

What I strongly disagree with @EvanB is the “auto-streaming”. Streaming or not is context-dependent. Some nodes like RDKit nodes are multi-threaded or streamable. Often you want the multi-threading over the streaming part. For me this would only make sense if the node has a default setting (normal/stream). When multiple by default streaming nodes are connected, then it would automatically stream. Still not sure this is a good approach and with the new table-back end it might affect when streaming actually makes sense and when not. (only really makes sense for IO limited nodes or non-multi threaded ones).

For the quick-connect, doesn’t the “Workflow coach” essential cover this? I don’t use it but maybe you would profit from it.

2 Likes

We want a sub-forum with feature wishlist :smiley: in order to track them and why not, vote them… it’ll be a wonderful tool for developers and for forum users.

3 Likes

Two thing are on my wishlist:

  • if an error or warning during runtime occurs, the node is listet in the console like this:
    WARN RowID 3:253:74:62 No row key column selected generate a new one
    where nodes 253/74 are metanodes, and 62 is the RowID node. Mostly I rename my Metanodes to more readable names like Summarize My MeasurementData. The problem is, due to the renaming it’s almost impossible to find the nodes :slight_smile:

  • sometimes I just want to have a look on the settings/options of a specific node, especially when I use a programming like in “rule-based row filter”-nodes. If the upstream nodes weren’t executed or well defined, the “traffic light” is red and then I cannot access the content. This annoys me a lot!

2 Likes

Now that you mention this, I want to add that this specific warning should be removed altogether. This configuration is totally normal for users that don’t like thing like “Row1_Row45_Row643” after several joins and other operations (and no changing the log level is not an option as that will not show actually relevant warnings)

4 Likes

I strongly disagree with your strong disagreement. You cite one specific node which you could easily override by an possible executer setting, not the array of simple data manipulation (math, string, row filter, rule) nodes where Knime suffers huge performance issues. For these, streaming execution will beat parallel procs handedly.

Here is a simple experiment supporting my point: Stream vs Parallel – KNIME Hub

With regard to the workflow coach, no, I don’t use it or profit from it. Neither of us using it is exactly my point. Having it out of the area in which I am currently interacting is not a good user experience, and the suggestions are typically useless and not based on the manipulations I typically make. Having to mouse, click, type, mouse, click, drag is cumbersome. I’d bet the top 5 nodes used in Knime across the globe are row filter, column filter, rule engine, groupby, and pivot. Why not have those on a quick-list? :thinking:

1 Like

Not sure why you would disagree with my idea. If the simple data manipulation nodes by default would stream if chained that would be just what you want while giving the option of other nodes to set their default as not streaming.

Do you understand why the streaming in your example is faster? Simply because the result is only written to disk once at the end of the component vs at the end of each node. Streaming helps if IO is the limiting factor. The performance gain diminishes quickly with fewer rows or more complicated manipulations.

Also the parallel chunk loop is of very, very limited use. Only useful for very, very long running calculations that are not already multi-threaded (usually external tools). The setup and consolidation else consumes way too much time, in my case combined 50s while the metanode runs around 60s. Just removing the parallel chunk still leads to a 60s runtime of the metanode which is 100% logical as we are disk limited not CPU limited and the parallel chunk loop just add another ton of IO on top.

The downside of the “auto-streaming” is lack of seeing intermediate results. I would absolutely not want that during workflow creation phase and you don’t need it for speed as you will not use full data to create it (if you have 10 mio rows). Then once your workflow or a part of it is correct, select these nodes and convert to component and set the executor. That is maybe 5 sec of time. (I also suspect the component is needed anyway for technical reasons not to mention the streaming execution is still beta)

1 Like

If you read carefully, I disagreed with your disagreement with my idea. Do you know that when you ask someone “Do you know bla bla bla” and provide examples of what that person probably already knows, you appear extremely condescending?

I think you are missing the point of this thread: user experience and use cases. Did you read the topic of this thread by the way? It says “from Alteryx” - an ETL/DS platform with the ways that product is used attached.

Ask data scientists in industry why they don’t use Knime and they will tell you probably one of two things. 1) What is Knime?/Not industry standard, or 2) it is too slow.

With regard to #1, Knime is attempting to position its product as a democratized data science platform to grow market share. Look at the Knime front page. It’s true. I’ll save you the clicks - it says “Data Science”. It’s also a really, really smart thing to do. The TAM for people who use RDKit is miniscule compared to the TAM for data science. If Knime wants to gain market share from Alteryx or Rapid Miner, they will need to be viewed as one of the industry standards in DS. However, Knime will never be industry standard if it does not address #2.

With regard to #2, Knime should NOT be slow by default. It should NOT ask users to download additional packages to speed up execution time. It should NOT assume that users even know that said packages exist or will stumble across them by chance. It should NOT ask or expect them to wrap nodes in components to avoid unnecessary disk writes. You can’t put the responsibility on people evaluating a tool or using it for the first time to know about advanced/optional features hidden away in some download repository they probably don’t even know about. Speed is king, and Knime knows this - that is why they are reengineering the back end for execution time! If Knime engineers are smart enough, and I know they are because I met some of them them, they will figure out a way to either get a sample of intermediate results or allow execution up through a node and execute streaming of appropriately connected streamable nodes at runtime.

I’ll bet you $5 USD that by or in version 5, streaming is turned on by default without having to use components. :stuck_out_tongue:

I would say KNIME simply is much better known in Europe than US/ North America especially so in the “life sciences related” industries (while Alteryx is rather not very well known in Europe)

Too slow really depends on the use-case and their are many was what too slow means. Development time and cost should also factor in over pure run-time performance. Of course faster runtime is always welcome. For me the main benefit of new table-backend will be the fact of not having to pay a serialization penalty when going to Python or R.

Would be interesting to see your example workflow as pure python. Is it really that much faster than KNIME? Honest question. (You did adjust Xmx setting in knime.ini according to your available RAM or set an appropriate value when using the installer? Again, honest question Can be a problem when “installing” from zip and AFAIK the default is only 512MB)

Taking your slowness claim for granted, what else does knime bring to the table? It does come with some benefits in terms of memory usage. In pandas everything is in memory always and it isn’t known for being particularity memory efficient. You hit a memory wall pretty quickly which you won’t in KNIME as many nodes internally are streamed as in only a fraction of data is ever in memory at one point. Yes you have other options than pandas but then it gets also more complicated and slower.

The streaming execution is still beta (in KNIME labs) and hence requires explicit installation by the user. I admit it’s been there for a surprisingly long time now and it’s unclear why or if any work is being done on it. So yeah making it production ready and available without package install makes sense. I agree.

Making streaming the default however is only fine if it doesn’t impair development / debugging use. Streaming makes it impossible to see intermediate results and were possible errors or issues occur. For that the step-wise execution is superior. Personally I don’t see a huge issue of having to use a component as explained in previous post. I use them anyway to group nodes together and keep a better overview.

Granted, Knime is more popular in Europe and in Life Sciences, but if they want to compete in the same verticals as Alteryx/Rapid Miner/SAS/SPSS/Tableau, they’re going to need to be more proactive in execution time and user experience.

Yes, I’ve got about 24GB of addressable memory set for Knime and 8 logical processors, so that’s not the issue.

I am aware of the Pandas memory limitations as well, but they are no where near Knime’s pipeline limitations. Python is industry standard, even more than R now. Don’t get me wrong - I love Knime for what it does well and think that it is highly underrepresented in industries outside of life science, given it’s capabilities. However, Knime needs a better reception than “I’ve heard of it, but we use Python/DataBricks” and “Knime is slow”.

Here’s analogous Python for what I am measuring in the streaming component. It only takes about a quarter of a second to run what is inside the component, vs 30+ seconds with Knime:

from sklearn.datasets import make_blobs
import pandas as pd
import time as t
    
# Create multi-d space, 10M rows, 3 feaures, 5 clusters, .01 std
features, clusters = make_blobs(n_samples = 10000000,
                  n_features = 3, 
                  centers = 5,
                  cluster_std = 0.1,
                  shuffle = True)

features_df = pd.DataFrame(features)

loop = 1
timer0 = t.perf_counter()
while loop <= 5:
    print("{} {:.0f}".format('Loop', loop))
    loop = loop+1
    added = features_df.iloc[:,0]+features_df.iloc[:,1]+features_df.iloc[:,2]

timer1 = t.perf_counter()
print("{} {:.3f}".format("Time in Seconds: ",timer1-timer0))

Still wondering if you are taking my bet. :thinking:

Yeah tried it in python too. It’s near instantaneous which makes sense. It’s “only” 100 Mio operations while modern cpus are in the giga flops range. I expect a pure Java implementation without IO to be just as fast.

However I also wrote the result out from pandas to file and that take pretty long time and results in a 1.6GB csv file. The python /notebook results are lost once you close it and you would have to redo all steps.

I take your bet. Their resources are limited after all and the focus (hopefully) is on the table-backend. The gain from streaming without component seems too small given to what I would think a big change to how knime works. The step-wise feature is also core-advantage when building workflows and in more complex operations the IO stops being the limiting factor.

Even with streaming it would be good if such complex nodes can be multi-threaded. That is why I don’t like default streaming.

2 Likes

The two main things that I would like to see in future versions are:
1 - Control drag to make copies of nodes. This happens in LabView and is much easier to use than having to crtl+c, crtl+v and then find and move the node to the correct place.
2 - Having a single button to reset and run all nodes. A number of people I have spoken to want to build up workflows that they can then repeat easily as new data comes in. It is possible to select all the nodes, reset them and then run all but it would be helpful to have this as a single click.

2 Likes

I use Spyder - cant stand notebooks, but you can always save notebooks out. Don’t get me wrong - workflows are drastically superior to code when it comes to intuitive understanding, which is why I prefer them. My hope is that Knime improves to where execution speed issues will be a thing of the past.

Glad that we have a gentlemen’s wager. :stuck_out_tongue:

1 Like

@namoroso Love the enhancements requests. I had a number of similar requests. I’m excited to see what updates come in the UX revamp.

btw: would you be interested in helping update this sheet I started to help those coming from Alteryx Designer?

1 Like

Hi @DemandEngineer. Thanks for throwing together that google sheet. I actually found it last month and was using it to become more acclimated to Knime. I am not sure my skills in Alteryx exceed what has already been captured, but I will definitely update it as I explore and learn new nodes!

1 Like

Thanks for your willingness to help. I’m sure you will catch things I’ve missed or have new suggestions. Welcome to the community.

BTW: what type of work are you using KNIME to do? I’m hoping to find more marketers in the community

1 Like

@christian.dietz here so related wishlist items:

1 Like

Here are some really powerful features that democratizes Data Science while putting in guard rails to keep the citizen DS from trouble. https://youtu.be/d4wrUZ9fZGM

While I know there is an AutoML component and some data Discovery Nodes… I think there are some big gaps to fill though.