Finding Knime to be slow compared to Alteryx. Any tips on speeding up my workflow? Is Knime the right tool for me?

Hi everyone,

The small company I work for has been thinking of switching to Knime after using Alteryx for a year as an alternative to using programming languages. I’ve been testing Knime to see if it’s worth using. I’ve had a go at completely re-writing one of my Alteryx import routines in Knime. So far it seems pretty good. I can easily do about 90% of the things I’m used to doing in Alteryx and I can find workarounds to do the other 90%.

Making my run script faster?

However, for large datasets, I’m finding that my Knime script runs about 10 times slower than my Alteryx script for the same data to make the same output. For 500mb of data, my Alteryx script takes 20 seconds and my Knime script takes 10 minutes.

I am wanting to ask if people can have a look at my Knime script to see if there is anything I can do to make it run faster? I’ve already had a look at several forum and blog posts about optimising Knime, including this one:

I believe I’ve done everything I can. My ini memory allocation is at -Xmx8155m. I’ve had a go at putting everything which is not in the loop in components and using simple streaming. That only achieves marginal gains. For 500mb of data, my Knime script takes 6 minutes instead of 10 minutes. For 32mb of data, my Knime script takes 24 seconds instead of 30 seconds.

Some observations I’ve made about the script are:

  • I’ve grudgingly had to use a loop in the middle of my script to perform a groupby cumulative sum. When testing on a small or medium dataset (~300kb or ~32mb respectively), the loop computations dominate running time over the transformation tools.
  • Yet for larger data (~500mb), the transformation tools take much longer and their running time seems to be about even with the loop calculation.

Is Knime the right tool for my company?

If there is nothing we can do and this is just how long it will take to process the data, I am wondering if Knime ultimately is the right tool for us? I notice that Knime is focused towards data analysis, predictive models, etc. and that ETL appears to be a means to an end to get to that. We don’t intend to use Knime for any predictive models and only for very basic analysis (e.g. sums and means).

We want to use Knime primarily as an ETL tool for big datasets. We regularly import data of the scale 100mb to 5gb into specialty software we use. Our specialty software has strict requirements on how data can be loaded into its database, and the data we get comes in a range of formats.

I’ve attached my Knime workflow with some sample data. I’ve also attached my Alteryx workflow too.
Files for Knime forum post.zip (3.8 MB)

2 Likes

I don’t have and never used alteryx so hard to compare anything.

Off the bat without understanding your workflow you could:

  • replace Column Expressions with Java Snippet (or string manipulation or other relevant node)

Anyone often on this forum knows I don’t like the Column Expressions node and the reason is it’s terribly slow. Look at your Timer output and you can see that.

  • Use Streaming

If I put all the nodes up to first group loop into a component and set it to streaming execution, the runtime on my machine (i7 laptop, nothing fancy) for the medium sized file goes from 16 to 10.5 seconds. However I think the streaming simply hides how slow column expression is. Replacing Column expressions with Java snippet most likely will have the same effect.

Can’t help much further as time series isn’t my main expertise. What I noticed is that most dwell time columns are 0 or missing (null). If you filter them out before the group loop runtime goes down to 5 seconds. they aren’t contributing anything to dwell time sum right? So maybe if you could remove them from the loop and add the back later that would help a ton. (of your 130’000 rows only 119!!! actually have a relevant dwell time).

Main help would be to find a way to do the moving average without the loop but time series experts will be better able to help you with that.

7 Likes

Hi there @Mercurius,

welcome to KNIME Community and KNIME itself!

For start I have to say I’m impressed as you really did your homework on KNIME. Here are my thoughts on this one:

  • So you already discovered streaming which is useful but it only helps on nodes that are streamable so benefit depends on how many streamable nodes you’ve got in your workflow (this doesn’t mean you only have to use streamable nodes!)
  • The blog post you linked is good but a bit outdated as certain things changed in the last three years. Here is a bit documentation about configuration and memory options: https://docs.knime.com/2019-12/analytics_platform_workbench_guide/index.html#configuring-knime-analytics-platform
  • Column Expressions is really a nice node and I use it a lot but when execution time is important I tend to go with other nodes like String Manipulation, Rule Engine, Math Formula… even meaning using multiple nodes to apply same logic (if possible of course). And we all know @beginner is not a fan of it :smiley:
  • from experience usually good gain in terms of execution time can come with smarter design and better process logic. The former should come with experience (although from what I have seen you did pretty well) and latter is on you and your company
  • sometimes you just can’t avoid loops but in recent releases they got much much faster
  • comparison between software is much more than just execution time obviously

So enough with thoughts! Here attached is modified workflow where I replaced two of three Column Expressions you had. Take a look. Anyways I won’t mention my running time as I have a bit old machine :man_facepalming:

20200203 ATP VIC code.knwf 1.knwf (132.6 KB)

P.S. Here is link on guide From Alteryx to KNIME in case you haven’t came across it :wink:

Br,
Ivan

9 Likes

Thanks @beginner and @ipazin for your replies.

So the impression I get is to avoid using loops and the column expression node where possible? But that ultimately there is a limit to how fast I can run Knime?

Thanks @beginner for your suggestion about playing around with the cumulative dwell time formula. I’ll make a separate post about it and see if anyone has suggestions.

And thanks @ipazin for your suggested alternatives to my column expression nodes. Also thank you for what you said about me doing my homework. I did try to make sure I found everything I could before making a post. I had found the From Alteryx to KNIME booklet already which was very helpful in getting me caught up to speed.

1 Like

Any program has a limit how fast it can run. I suspect your use-case simply is an “edge case” where alteryx has a dedicated “block/Node” that does exactly what you need while in KNIME you need to build it yourself which is slower. But if speed is your ultimate goal, both tools are probably not the best choice.

Speed isn’t the only measure if “goodness”. The reason for using knime in my case is simple, it’s free, it does what we need and the life science / chemistry plugins which other such tools lack.

An important note is the different philosophy between tools. Most tools come with these large, heavy building blocks that do ton of things. It’s nice and fast if it exactly fits you needs. KNIME is more like programming. It offers you the basic constructs (well a bit more than that really but let’s stick to the analogy) and you have to plug them together to create an “algorithm”. It’s more flexible and you can see in each step exactly what happens. This at the cost of speed (maybe? have you compared other workflows to alteryx?) but mostly the basic IT-skill needed to use it efficiently.

7 Likes

You already have a lot of hints concerning performance and the construction of specific tasks. I would like to point you to a collection of entries that deal with performance (yes one day they might be put together in a single post without redundancies).

3 Likes

I think you can simply replace the group loop with a java snippet. As far as I understood the data is already sorted in the file (but if not you could sort it in KNIME). And then in the java snippet:

double cumulativeSum = 0;

// expression start
    public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:

if (c_UniquetripID.equals(c_UniquetripID1)
    || c_UniquetripID1 == null) { // first row
    
    if (c_dwelltime != null) { // exclude last row of trip
        cumulativeSum += c_dwelltime;
    }			
} else {
    cumulativeSum = c_dwelltime;
}

out_dwelltimeSum = cumulativeSum;

Quick sanity check indicates 100% identical results. java snippet runs in less than 300ms. See looping is slow. Very slow. You can then take @ipazin workflow and replace the last column expressions node(which in my case by far the slowest node.

4 Likes

Thanks @beginner. I’ve been wondering how to get rid of that loop. And I’m glad to hear it runs in less than 300ms. That’s very exciting :grinning:. I’m very keen to get rid of that last column expressions node. I’ll have a go at it myself and post if it get stuck.

It sounds like I’ll be making good use of Java snippets from now on. Are there any downsides to using Java snippets which I should be aware about? What’s the difference between a Java snippet and Java snippet (simple) node?

Downside is, your workflow can hide complexity. I would only really use it if there is no other way either due to complex logic or speed. Also you now have source code which is hard to track and maintain. So I would still consider is as a “measure of last resort” especially if the workflow will run in production.

Haven’t used the (simple) one in along time which tells you, to use the non-simple one. As far as I remember the main difference is the non-simple one is like a real editor with code highlight and code completion. I don’t actually know why the simple one still exists. Probably legacy reasons.

4 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.