Create a lsrt day month from an text

Hi @frepez,

For the component, you can right-click on it and “disconnect link”, but be aware that having done so, it wouldn’t get updated if I upload a change to it on the hub. For testing purposes though that shouldn’t be a problem.

How many rows have you got, and how many columns?

I just created a test table of 10 million rows, and had it perform two conversions, using both the String Manipulation (multi-column) and the component. Yes the component is slower because it has to perform actions on specific selected columns which it needs to do in a loop, which ultimately requires appending columns (via 2 Loop start (column append) nodes and a Column Appender and both of those take quite a chunk of time relative to everything else. I suspect that is the bottleneck.

However for my 10 million rows and abour 10 columns, the execution time of the component was 158 seconds for me. For the combined processing through the String Manipulation (multi column) it took 40-50 seconds. So yes the component in my test was around 3-4 times slower, but I didn’t get figures anything like your 9 minutes and 2.5 hours.

This is what my test workflow looks like

How much memory have you got available on the computer, and how much has been allowed for KNIME? Have you modified the setting in the knime.ini file to increase the available memory?

Also, which version of KNIME are you using?

I don’t know if you forgot to share the workflow, or are about to share it but when you have uploaded it, I’ll take a look.

Hi @takbb, after open component is showed this advise: “this is a linked component and therefore cannot be edited” then I can´t put a time monitor to see subnodes duration (or I dont´ know how to, I´m newbie!!)

I share a workflow to comparing as 1/1000 of all data due size to upload restriction and my slow conexion
date.knwf (789.9 KB)

knime.ini as factory!!

my PC 16 gb ram

Knime version 5.2

I´m sorry slow conection…

Hello @frepez,

for first ady of month solution with join and substr in String Manipulation Multi Column node is probably the best way of doing it only you can uncheck option Insert missing values as null to get missing values where there is no date instead of “01” although String do Date node will deal with it later as well. For getting last day of month you can use Column Expressions node as it can deal with Date&Time columns with dedicated functions. In this case you will use plusTemporal() function twice. Once to add 1 month and second to substract 1 day. Might work faster than current solutions. Take a look at example.

date_ipazin.knwf (1017.7 KB)

Br,
Ivan

@ipazin and @frepez I agree that the String concatenation of “01” for first of the month is probably the fastest solution.

The column expressions temporal functions are very useful and are certainly a good option for this problem. Useful suggestions as always Ivan! :slight_smile:

The issue on performance for the component though is not that the specific solutions (java snippet/String manipulation or even Column Expressions) in isolation are slow , but in turning the solutions into a re-usable generic component (so we don’t have to write the same piece of code on a “per column” basis), a couple of loops were introduced.

The loops in themselves are not a problem here but performance really drops with how they have to handle the “Column Appends” at the end of the loops, to bring the newly created columns back into the fold.

In addition, scaling this up to the original 6 million+ rows (assuming the “1/1000th” comment was accurate) , and the memory requirements will mean that @frepez’s factory setting of -Xmx2048 is really going to slug performance.

I set mine up to 10GB and watching it process 6m rows, it is regularly hitting the 8GB mark, so squeezing that into just 2GB is likely to mean a lot of swapping to disk assuming it doesn’t just run out of required memory.

@frepez ,I am going to look at rewriting the component “long-hand” :wink:

i.e with no loops, and use a hybrid approach for the different temporal requirements (e.g. the “01” concatenation for first of month) and see what a difference it makes for the component option. I’ll make use of some of @ipazin’s suggestions too, and if I have time try it with String Manipulation (multi), java snippets and the Column Expressions nodes to get a test of which performs best, then go with that one.

Not sure when I will be able to do it though. Might be soon, might be a few days.

In the meantime, I’d strong recommend increasing the memory available to KNIME. Edit the KNIME.INI file (take a backup first!) and change the line
-Xmx2048
to something like
-Xmx10G (with 16GB machine, I’d suggest 10G for KNIME should leave memory for other things.)

Further Information on memory settings

If you have lots of other things running though, set the memory for KNIME lower, because if KNIME is told it can have 10G of memory, but when it goes to use it the 10G is not all available, KNIME will do a rapid exit (i.e. it will just vanish with anything unsaved lost). The setting does not “allocate” memory to KNIME as such, it simply tells KNIME what it is allowed to ask for:

  • Set it too low, and if it wants more memory than it is allowed to ask for then you get a “relatively graceful” out of heap/memory message.
  • BUT set it too high and if it then asks for more than is available… well rapid and unsaved exit, I’m afraid. It’s happened to me a couple of times.

[Edit: Actually I just thought through what I said about the component - I think String Manipulation (Multi Column) is the only solution that can be written generically to work with multiple columns, selected dynamically, without being forced to introduce a Column Append loop, so that’s the approach I think I will have to take]

1 Like

hi and thanks @takbb and @ipazin,

I didn´t never edit knime.ini and was setted as:
-Xmx8114m

its ok?

Regarding the rapid and unsaved exit it is possible to force to save before each run? in order to no lose work?

@frepez
What’s the size of your dataset?

for this 7 millons registers, but I have others with 50 millons registes and I need to do the same calculations

You might want to avoid loops if possible
Also is scripting an option?

yes, sure. But Im not java programer :frowning:

hi @ipazin, how can I reuse this “Column Expressions node”? and how i can modify these node?

Thanks!!!

Hello @frepez,

Column Expressions syntax is based on JavaScript but it offers built-in functions and placeholders similar to other KNIME nodes like String Manipulation, Math Formula, Rule Engine…

I don’t understand what do you mean by “reuse”? If you don’t have it in Node Repository you need to install it. Check here on how to install it:

Once you have it you can add as much expressions as you need by simply doing copy and paste.

Br,
Ivan

1 Like

Hi @ipazin,

Your help was really nice, and actually “Column expresion” is a node that I really going to use a lot, I did download and is very usefull!!! thanks a lot.

2 Likes

Hi @frepez , I modified the component so that it uses a sequence of String Manipulation (Multi Column) and therefore doesn’t have to perform Column Appends as part of a column loop or in looping between operations. The component now contains no loops. If you update the component, you should get the latest version.

I modified the demo workflow so that it had over 50 million rows (58275000 rows). I have a pc with 32 GB of memory and so I set the KNIME.INI file to
-Xmx26G
(26 Gigabytes available) in KNIME 4.7.7

The result was that the Column Expressions along with the String Manipulation and String to Date&Time nodes

image

processed those rows in 3524146 milleseconds ( 59 minutes)

Whereas the component along with the String to Date&Time (to make your data into Local dates)

image

processed them in 3811925 millseconds (63 minutes).

If it also put the original 4 columns back in their original (non-data) format using a further Date&Time to String node
image

an extra 880822 milliseconds (14 minutes) was added!

The primary reason for the additional time of the component is that it needed to convert dates back into Strings before it could process them. There will always be some kind of performance overhead in making something more generic…

If your columns had already been presented as Date datatypes, instead of strings/integers you could possibly have knocked about 10 minutes off the processing. Likewise if I wrote the component to expect the data to already be presented as Strings in yyyymmdd format, rather than generically expecting Local Date format which had to be converted, about 10 minutes could probably have been removed.

So I think the upshot is that you can use String Manipulation (Multi Column) or you can use Column Expressions to perform the job and define the expression for each task.

Or you can use the component, for ease of operation (and not having to copy/paste the expression each time) and this will give you slightly poorer performance.

It remains though to be seen what happens when 50 million rows is to be processed in just 8GB or 10GB of memory on a 16 GB system. You can reasonably expect it to take longer. With 26GB available, KNIME at times used all of the available memory. With less memory available it will almost certainly have to page to disk, and this will reduce performance. I would therefore expect it to take much more than 50 minutes on your system. I would suggest you acquire more memory if you have this kind of data requirements.

Your setting of
-Xmx8114m
(a little under 8GB) may be acceptable on your system On a 16GB system you could maybe increase it to 10Gigabytes with:
-Xmx10G

but that’s for you to trial. I’ll leave it running with that setting and will give an update of how long it takes on my system…

In terms of looping… as has been mentioned, loops are generally to be avoided but there is the possibility that putting the processing into a Chunk Loop so that it processes perhaps 5 million rows per iteration could improve performance as it might reduce the need to page to disk. Specific types of loops, in certain situations can improve performance, but you really need to trial that for yourself.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.