Ways to speed this up? 10M + rows

Hi Folks

looking for suggestions as to how i can speed this up

I have a data set with normally 10M+ rows

within this i have a col with anything from 5+ unique values normally not exceeding 100.

For each unique value in that col i need to filter rows then perform a data integrity check

The row filter takes a massive amount of time due to the number of rows, and when i put this in a loop, well, that’s multiplied by no of unique values

Any ideas as to how i can speed this up.

I was thinking of a quick way to split out the data dynamically into the relevant groups, then running parallely but haven’t figured out how yet…

cheers

Gavin

Being daft, Group loop start will help.

However it keeps getting stuck on 50%.

I added a Sorter before this, and now that is stuck at 50%, it appears the sorting is the problem.
Has anyone see where Sorter gets tuck with big data?

kr

Hi @Gavin_Attard how much memory do you have and how much is available to KNIME? Memory can make a huge difference.

4 Likes

Hello @Gavin_Attard
My first thought is, would be… it’s possible to code into R? traffic data to R serve will happen twice (?)

But keeping it in KNIME; if you replace the ‘Row Filter’ within a Recursive Loop…

Replacing the Filter with a ‘Rule-based Row Splitter’; then you can send the remaining data to the recursive loop end’s back port

… the first iteration will be as slow as in a standard loop; however the splitter will work with less and less data on each iteration.

I hope that this idea can improve the performance.

BR

PS.- Standard Row Splitter could do the job as well

1 Like

I would assume with 10m rows data is comming from a database rather then a flat file so I first would try to leverage db nodes to do the transformation in the database before loading to KNIME. Have you tried using DB Row filter and other nodes already?
br

6 Likes

Try to use streaming including in component reading node and filter. Also, to free memory, after component add

node.

5 Likes

@Gavin_Attard streaming has already been mentioned as an option, also trying to let the database do some work.

I can offer up this collection of hints about KNIME and performance:

2 Likes

@Gavin_Attard

Couple of questions not yet asked:

  1. What version of KNIME are you using?
  2. Are you using the Apache Parquet columnar tables backend? Preferences->KNIME->Table Backend->Default (you may need to install Columnar Table Backend extension).
  3. How much memory have you allocated?
  4. How complex are the items you are filtering on - if strings, how long are they and is there a high degree of correlation in the first couple of letters?)

I’ve created a sample workflow (Scalable Row Filter 2023-07-02 PA1 – KNIME Community Hub) with 1,000,000 rows and average time to filter the data with 80 categories is less than 500mSec. So, if your workflow is taking a long time there is scope to optimise.

String matching

The first consideration is the nature of the data that you are matching. Consider the following strings:

“PROD-001-456”
“PROD-001-457”

When matching strings, the computer will compare the first letter, the second, etc…until it either reaches the end of the string and confirms the strings match, or finds a letter that doesn’t match and aborts early. Therefore, if your strings are long matching is going to take a long time and row filter will have a correspondingly long execution time. Also, if the first parts of the string are common across all the strings you will introduce a constant long delay in the matching process as each common letter is tested, until letters which identify a difference are reached.

Can you remove redundancy in the data by trimming strings to remove redundant (common) information? i.e. matching a substring (e.g. last four characters) rather than all characters.

The second optimisation is to replace the string with an integer (index number) that represents the category. Matching integers is a fast process, is a native function within the CPU, and is what most database engineers would do (normalise the data so that the main (fact) table contains indexes that point to dimension tables that contain the category descriptions). It is far quicker to match on integer indices, improves data processing speeds and reduces memory usage.

[Note, the table backend uses this approach with string columns - it creates a dimension table to store common strings, and then a fact table containing rows with pointers to the actual strings. By reducing redundancy it can significantly reduce the size of data stored to disc.]

Can you add an index column (integer) that represents the string to be match, then filter on the integer rather than the string?

The workflow that I created considered four ways of filtering the rows:

  1. Filtering on strings, which took 531mSec
  2. Filtering on integers, which took 339mSec - note, if you don’t have a column with integers you would need to factor in the time take to create a column. Easily done by grouping the original data column to get a list of categories, adding on integer sequence using the Counter Generation in the modular data generator extension then joining this integer index back to the original table using a joiner node.
  3. I’m not 100% sure how the row filter node works - I am making a guess that if there is domain information that says there are a limited number of strings (e.g. it is categorical data not random text), then it will use the category indices to do the match, not the actual strings. KNIME domain assumes a column is categorical if there are less than 60 different permutations of strings in a column, however, this limit can be raised using the Domain Calculator by unchecking Restrict The Number of Possible Values. This reduced the row filtering time to 391 mSec, however, the Domain Calculator took 607mSec. If you have long strings then this may give you a positive benefit - I don’t know your data, so don’t know.
  4. You can filter using a join node. This took 361mSec on string matching. Sometime, this approach works better than row filter node.

As always, your results may very and a lot will depend upon your data. If you are not using the columnar backend you will see a big improvement in using it. Otherwise, if you are matching strings then you can either manipulate the strings to improve matching performance or replace them with an index.

Hope that helps

DiaAzul
LinkedIn | Medium | GitHub

7 Likes

HI All

Thank you so much for your guidance and contributions.

@takbb and @DiaAzul

ini file is set to 32604m - so plenty…

i think the idea of using integer rather than string may be a good idea. i will test that though group by performance is also slow on this size data set.

Optimizing using db Tools is not viable in this case as the task involves downloading data with unknow schema (its GA4 BQ data, and event parameters are nested in json, so the task is to download and parse, then perform data validation checks. )

There are a couple of suggestions to use the columnar back end.
I implemented that, and disappointingly, it has very painfully slowed down workflow execution, to the point i couldn’t even get the area of the workflow i wrote originally about.

The significant slow down happened around loop ends, specifically parallel chunks and chunk loop ends. These took a magnitude of 10 times longer.

Has anyone else experienced this issue with columnar tables and loop ends?

I am using default columnar settings with a 20g allocation to memory.

@Gavin_Attard

It also depends how much physical memory you have. Within the ini file you can set the size of the heap (memory storage in KNIME). However, both KNIME, extensions written in languages other than Java and the operating system also has additional memory needs. For example, when reading and writing files KNIME & the operating system create buffers so that data can be collated and written in chunks rather than byte-by-byte. Apologies if I am teaching you to suck eggs, but not everyone reading the forum has the same level of knowledge.

Making KNIME run quickly is about balancing the needs of all parts of the system, not just about allocating as much heap space as possible. This is a tricky task when dealing with large files and complex workflows that could be using component that have memory needs that are not well documented. It is difficult to determine which components use the heap and which have memory requirements outside of the heap.

There are tools to help work out what is happening, even with these tools it is difficult to work out what is actually causing the problem given that there is usually a lot happening in the machine at the same time.

You may want to consider reducing the size of the heap - it may sound counter-intuitive - but what you are describing sounds like memory exhaustion at the OS level which is driving a high number of page swaps. You can check this in Performance Monitor to see if there is a lot of paging activity to the swap file.

DiaAzul

1 Like

Thanks @DiaAzul

Ye i have 64GB and there is plenty of headroom
image

From my observations, it appears that any item that is writing to disk is where time is spent.

So writing a file to disk (knime table format) super slow. (15M rows). My guess is this is the compression process, but i would expect it to be faster, CPU is barely being used…
Faster if i use parquet, but can’t seem to read the data back into the nodes.

There is a bottle neck somewhere, no doubt, but i fear it’s more of a core code issue then anything i can tweak.

@Gavin_Attard you could try and put a Cache node in front of the writer node.

What does this mean? What kind of data is in there that you cannot re-load it?

Have you tried to write the parquet data maybe in chunks?

Have you tried suggestions from the article like using streaming? Maybe put a cache node in fron of a streaming part.

Hi
@mlauber71

Ye, when i try to read the parquet file it won’t run.

i’m a bit hesitatn to use cache as write to disk seems to be very very slow. I’l give it a go and se

Can you try with a smaller sub-set to see if it does run at all?

The Cache node can also store the data in memory. But then the question is again if it would fit.

My impression is that still something is going on with your disk. Have you checked for aggressive virus scanners?

Hello @Gavin_Attard,

coming back to your initial question and problem.

10M+ rows is far from massive in my opinion. I have been dealing with 100M+ rows in Row Filter and it was slow but noting I was worried about. How long does it take for you?

The wider the table the longer it lasts so if you can do some column filtering prior row filtering do it. (Edit: this actually doesn’t help as Column Filter node only hides columns in output so one needs to use Cache node for example to have effect). Additionally long RowIDs can have huge impact on processing so if you have Joiner in a loop or Cross Joiner make sure to use RowID node to “reset it” (Edit: after additional tests came to conclusion that length of RowID dones’t have effect on Row Filter execution time). I have done some tests and 15M rows and 5 columns lasts around 20 seconds on my laptop. Adding columns (40) or extending row ids (e.g. Row0_Row0_Row0_Row0_Row0_Row0_Row0_Row2) using Cross Joiner prolongs this to more than 5 times. Tests are done on v4.7.4. and solid laptop.

Hope this helps.

Br,
Ivan

5 Likes

@ipazin Long Time my friend!

and as always your insight is most valuable.

The data set is wide, and i can’t get away from this as its the nature of this task.
however i will try restricting strings and i will look into row ids as i was not aware of that.

I also removed the workspace folder from antivirus scan.

lets see if all this helps

thanks again

1 Like

Hi guys,

I read all comments here and I think that I can bring some tips to try too…

1- Use the “Rowid node” to reset/creat a new sequence to the rows. It’ll clean the size and better to classify later the data too.

image

image

2- Try the Sample node to test part of the data and see what can happen. I’ll spend less time from this momment and after, you can set to use 100% of all data or just remove this node.

image

image

image

Here I set just 5% from the top of the data table.

3- Sometime I use the “Partitioning node” to break it into branch/treads with “Wait node” from the end of the first one to the second one. So it’ll runs like “Chunk loop node”, but you can see and rerun part of this where it stopped by error.

image

image

image

Like the sample node, I set 50% to split it relative from the top table data.

image

image

image

I don’t set any time here because I just need to wait until the process before be finished.

After that, you can concatenate all rows together and works with the needed data.

The ideia is:

Workflow:
KNIME_project_Branch.knwf (32.4 KB)

As you can see here, after each sort node process, it’ll start the new branch and go on.

Can you try it and give us some insights about it? For me, it works very well but for your case, can help you better.

Thanks,

Denis

2 Likes

Good to know. never realized this so far but usually also don’t have millions of rows.

it begs the question: Shouldn’t this be addressed by KNIME in the KNIME core? Because there does not seem to be an obvious reason why this is needed? I thought only changed data gets written to disk again and a “row filter” should then only mark rows as removed, bascially should lead to very little writes not affected by rowid length or table width.

1 Like

First simple things: What type of ssd do you have? cheapo QLC drives without dram can be terribly slow, like close to hdd slow, with tons of small writes. Another issues is if the ssd is “close to full”, performance can start to slow down and drop of a cliff at some point leading to an unusable system.
Also if you use columnar table backend, you can reduce memory for knime in the ini file and importantly be sure “ini” + backend isn’t more than your system has. Else it will start using swap file and then things also get very, very slow.

I often use Python and Python node also suffers from wide input as more stuff needs to be moved to python. I usually just use a column splitter and then column appender after the python node. this can lead to significant speed improvements.

In your case you could do similar. Fix rowid, and then split columns. only the column you filter on goes in row filter. Then with reference row filter you can remove the rows from the other columns and use column appender. Not sure this is faster but given @ipazin tests, wide being so slow, it might indeed be faster.

And while writing this, of course there is always “the hammer”. Just stream in your file into knime using python source node (or R). you can then filter “rows” while reading them. (“rows” because it can also work if a row us multiple lines, just read what belongs to a row, then decided to keep it or not.). This will also solve any memory issues if lots of rows are removed by filtering. There is a reason some purists don’t like GUI tools. here you likley are running into limitations and in pure python you could probably make it a lot faster.

Or another “hack” could be to do as you did and then load all the rows into a local sqlite (or Duckdb) and then filter in the database and load again. this gives you full power of SQL and you can run all filters in one go.

3 Likes

How about using Chunk Loop? I think it can work