A few uncommon tasks

badger101 · September 28, 2022, 4:01pm

Dear Knimers, today I dealt with a few uncommon tasks. This post is intended to be a sharing session, but I hope it’s not one directional.

Removing rows that contain missing values for ALL columns:

Context: The Row Filter Node can’t handle multiple columns at once. Meanwhile, the Missing Value Node treats all selections as ‘OR’ instead of ‘AND’.

The solution I came up with is the Rule-Based Row Filter Node:

As you can see, I had to manually write the script to include all columns and to use the ‘AND’ operator. Imagine if the number of columns is >20, that’ll be irritating - I guess.

I wonder how would you approach this differently?

Dealing with the downstream tasks after the Loop End (Column Append) Node.

Context: Each iteration produces duplicates of columns of the same attribute. In order to use the Concatenate Node to join the iterations together under the same attribute, the columns need to be split (and then renamed). Splitting the columns using the Column Splitter Node has its own difficulties: If one is to choose the Manual Selection, as the name suggests, it’s a manual work. And the node has to be used repeatedly to split all columns from each iteration. If one is to choose Wildcard/Regex, even though there’s a pattern that can be utilized (whereby each column has the iteration number appended to the name), the first iteration (Iteration 0) is not available. Lastly, if one is to choose the Type Selection, that’s not helpful since each group of iteration columns can’t be distinguished by the column types.

Here’s how a simple version of how a table may look like upon being regurgitated by the Loop End (Column Append) Node:

The solution I came up with is:

Which converts the table into:

Another uncommon task I encountered today is to classify the result of Twitter API’s profile icons into three groups:

i) Non-human images
ii) Image of a male human
iii) Image of a female human

This is useful for demographic descriptive analysis. I’ve heard that there are libraries out there for human skin detection, and gender recognition. I’ve also heard that Knime’s adopting the unsupervised zero shot models, which, although is more focused on NLPs, but based on a few presentations I watched on YouTube, it seems like it’s also possible for image classification. I’m hoping that in the future there’ll be example workflows for such purpose.

At the moment, I have no workaround for such image classifications. Till then, I’ll leave the thread open for commentaries/discussions. Have a great day!

DiaAzul · September 28, 2022, 4:15pm

@badger101

For your rows with missing values you can aggregate all columns into a set and then exclude those without a set. I’m not sure how it would perform with lots of columns and data - whether the set creates new data objects (inefficient), or just a list of pointers to the data objects (efficient). But it is the quickest and most flexible way I have found to perform this task.

Screenshot 2022-09-28 171145

ArjenEX · September 28, 2022, 4:27pm

Hi @badger101

Was working on #1 when @DiaAzul beat me to it. I would opt for the normal concat option rather than Set, this creates just a normal string.

iCFO · September 28, 2022, 4:29pm

Just drag it on and you are done.

badger101 · September 28, 2022, 4:30pm

Thank you @DiaAzul , it’s amazing how a small thing could change an outcome. I have already tried that workaround before writing this forum post. Intuitively, I ticked the option for the Missing Values box because I thought the Row Filter Node that follows will filter the missing values out.

What I didn’t know was that if I leave the box unticked, as you did up there, the rows with all missing values will produce a missing cell ! Thank you for this discovery!

badger101 · September 28, 2022, 4:33pm

@ArjenEX Thank you too !

badger101 · September 28, 2022, 4:34pm

@iCFO That’s from a community extension I’ve never heard of. Will definitely check it out! Thanks for sharing!

iCFO · September 28, 2022, 4:41pm

I have to regularly use “add empty rows” / “remove empty rows” to deal with financial spreadsheet inputs and outputs. I keep them in my favorites for quick access.

ArjenEX · September 28, 2022, 4:43pm

Would it be possible to upload the workflow behind number 2? For some reason it screams Column Rename (Regex) and Loop End (Column Appender) to me.

badger101 · September 28, 2022, 4:44pm

Sharing Is Caring.knwf (58.6 KB)

Here you go @ArjenEX .

The Table Creator has the sample table that mimics the result of a Loop End (Column Append) Node.

DiaAzul · September 28, 2022, 4:47pm

@badger101 , you’re welcome.

@ArjenEX , you could concatenate string, however, with my coding hat on string concatenation is always a memory churning activity (you can’t just modify strings in memory, you always end up copying both parts of the string to another part of memory to create the new string and then releasing the original memory). With a set I a making a guess that the set just creates a set of pointers to the original objects and doesn’t create new ones. I don’t have the code, so can’t comment with absolute certainty, but it is how I was thinking.

badger101 · September 28, 2022, 4:51pm

I tested both:

Context: Column = 9, Rows = 5876, some columns contain images.

Both gives similar results.

Execution time:

Set = 1 minute. Concatenate = Instant

bruno29a · September 28, 2022, 6:30pm

@badger101 first of all, brilliant thread

@DiaAzul @ArjenEX @badger101 I always thought that concat would be faster than set - and it seems to be the case based on @badger101 's last post, and for that reason I would have used concat as @ArjenEX did instead of set, but I’m intrigued by what @DiaAzul said.

For me, I would think that concat is simply appending the new data at the end and that’s it, while with set, it would probably still append the data, but it needs to keep track of where each piece of the data is in the set.

@iCFO great share there, I was not aware of this node. I usually use the same method as to what @ArjenEX used in my previous workflows to determine if there is any empty rows or not.

takbb · September 28, 2022, 7:28pm

Hi @badger101, I second @bruno29a 's sentiment. Good to see a discussion thread like this, as we can all learn something new from other people’s ideas.

With your question about the output from Column Append, I wonder if there is something that could be done in the loop (or maybe by using a different type of loop end? I know these kind of situations occur, but I was struggling to think up an example flow that created this situation and I’d be interested if the original flow could be re-worked so as to prevent this happening in the first place.

But anyway, assuming it does happen and there isn’t a remedy…

Your flow handles the job, and all I could do was think of some refinements to make it more generic, and also to have it rename the final output so it has the column names taken from the original dataset.

I also extended your sample data, adding an additional column “Year” and then adding an additional Iteration. This was just to prove the code still worked for more iteration columns than just 2.

Because I had added an extra column, I needed to go manually change the Math Formula from a divisor of 2, to a divsor of 3. I then added on some additional nodes (Extract Column Header, Transpose and Insert Column Headers) and their purpose is to use the original column names from the data table to rename the resultant columns from your loop.

This appears to work:

but I could see it should be made more generic.

The main thing was that prior knowledge of the number of columns in the original data set should not be required. If we know a pattern that identifies a repeat column (#Iter … ) then if we filter those out using a Column Filter we can use Extract Table Dimensions to count the columns that are left. These are the “original” columns, and we can apply that count into the Math Formula node as the divisor.

And finally, if this is something you or others (or future me!) often face, I figured it ought to be turned into a component Just because…

And if it is going to be a component, it should be flexible enough to handle other “repeated column” patterns using regex. So it allows that to be configured, with a default of #Iter.*

The component can be found here:

The sample workflow here:

Sharing Is Caring - plus component.knwf (63.1 KB)

badger101 · September 28, 2022, 7:51pm

Hi @takbb , let me take some time to go through your post. I’ll give a reply to it. In the meantime, I think there’s something interesting about the Concat versus Set debate. The trial I did earlier wasn’t done concurrently. Just now, after closing the workflow window and running the garbage collector, and reopening the workflow to conduct a non-biased trial, here’s what I found when I initiate both tests at the same time.

The Concat executes faster than the Set. Right after it finishes (about 1 minute in time), the Set finishes instantly. There are two possible reasons:

It could be that the slower counterpart (Set) ‘cheats’ (something to do with the cached memory?). This possible cheating reminds me of the rivalry between Celera versus HGP in the human genome project back in the 90’s.
It could also mean that Set picks up its momentum right after a certain point.

Since I don’t have the technical knowledge as to how the memory system works, I can’t choose a side.

Alas, I’ve also tested whether removing the image column alters the result (since I assume loading images might affect the execution time):

It appears that column type may not have anything to do with the comparative performance.

DiaAzul · September 28, 2022, 8:42pm

@badger101

Thanks for doing a comparison between set and concat that is very helpful. Following on from your work I would posit the following may be happening.

I made an assumption that set would be faster than concat on the basis that the original objects would not be copied to the aggregation column. On reflection, I don’t think this assumption is correct. Therefore, when creating a set, KNIME copies all the objects to the new column, then creates a set object to encapsulate them. Lots of work, which is why set is slower.
Concat doesn’t copy the object, it only concatenates the string representation of the object. For strings that is the string, but for numbers it is the string representation of the number, for images it is a string representing the image (not the whole image). Only one object is created in the aggregation column which is the final string.
KNIME caches data in memory for each node. I suspect that the difference in processing time for the set processed after concat is because the concat triggered data to be cached in memory; then set could benefit from the already loaded data.
We are all missing the fastest option in the Column Aggregator node, which is Missing value count. This counts the number of missing values in the row. If the missing value count equals the number of columns then all columns are missing data. You can also calculate percentage missing values with this approach if you want to set a threshold for missing values. It’s also quicker than both set and concat.

This is all a bit moot, as there is a community node that does it all for us

takbb · September 28, 2022, 8:46pm

Hi @badger101, to do a fair comparison I would run the two trials separately and do them a number of times. I’d also add a Timer node at the end to capture the execution time for each.

Running a thread where both nodes are executing at the “same time” to see which visually finishes first doesn’t really guarantee anything. Whilst in computing terms both might run in their own “thread”, actual parallel concurrent processing can never be guaranteed.

Let’s assume for the sake of argument that it is a single processor, or more specifically, a single core. In this situation parallel processing is actually an illusion…

At any one moment only one thread can be executing. It runs until it yields execution to another thread, at which point the other thread picks up and continues for a short time before it too yields and it’s back to the first thread again. This continual task switching repeatedly occurs until one and then the other thread completes.

There are a number of reasons why a thread will yield execution. It may be that it is allocated a specific slice of time by the job scheduler (built into the Java VM or maybe the operating system). It may be that it is allowed to process until it reaches specific points such as when it has to access physical hardware such as the disk and allows itself to be interrupted whilst the much slower peripherals are accessed by other components in the system.

Whatever the mechanism, it might be that just by fluke one thread is able to complete before the other even though it’s actually processing time has been longer. The other might have spent a lot of time waiting to complete simply because the other thread didn’t yield “fairly”.

In KNIME processing terms, I don’t know the technical details for the concat versus set options but there are some things I could infer.

Assuming you are using concatenate rather than concatenate(unique) I would typically expect Concat to be faster than Set. A Set requires additional processing because it contains no duplicates. You might therefore expect a processing overhead for “set management”.

On the other hand Concatenate allows duplicates and so all it does is continually append data to the end. As has been mentioned earlier in your thread, simple String concatenation in Java is memory intensive as every time a string is concatenated, Java creates a new object to store the new larger string. However some Java String classes such as StringBuffer or StringBuilder don’t do this and instead use a single unchanging object (memory reference) so don’t suffer from the same issues. Without getting more technical I suspect that Concatenation doesn’t suffer from the memory issues associated with inefficient String coding.

So concatenation might use more memory because it doesn’t remove duplicates. Set might use less memory but if expect it to rewrite slightly more processor. The actual result may depend to an extent on the size of your data items in each row.

Concatenating very large items that contain lots of duplication might be slower because of overall memory availability than removing duplicates in that same data set. If the system has to start switching out pages of memory to accommodate high memory usage this could be disastrous on performance. So I don’t think it is totally predicable for all scenarios but if I had to place a bet, and I have plenty of memory, I’d generally go with Concatenate.

[Edit… I think @DiaAzul beat me to it… lol]

badger101 · September 28, 2022, 9:03pm

Hi again @takbb , first of all, thank you for such a detailed post.

As a component, that column rename element is a nice addition. I haven’t checked the component, but based on what you’ve written (and please correct me if I’m wrong), for this particular issue of mine, it seems that the component needs to be fed by #Iter.* strings right? I have no doubt that it’ll work when all column names contain that substring. As I mentioned in the post, the issue with the Loop End (Column Append) Node is that the first iteration (Iteration 0) doesn’t have that substring. Here’s a dummy outcome from a real result of that node to re-emphasize my point:

Since the info we have at hand is how many columns we have per group of iterations, it’s possible to use this info to rename the first x columns such that they include #Iter 0 substrings, before feeding the table into your component. What I can do is to manually rename the first x columns using the Column Rename Node. I can see that this is doable. Do you have any pointers as to how can I do it in a less tedious manner using the Column Rename (Regex)? (I’m guessing @ArjenEX might have figured out something by now, since this node was mentioned earlier.) I would appreciate any workaround that automates things as much as possible.

Let me first point out that as far as example workflows with Twitter API in the hub is concerned, it can be considered that iterating query variations in loops is uncommon. Most of the workflows up there shows that people are querying high volumes of Tweets and analyzing them one query at a time. While that’s something I do too, my task also involves many variations from that one query.

The typical loop end node is not able to hold the result from a previous query, as the incoming results overwrite the table. The typical way to overcome this hurdle is by writing the results on an output file, as shown in —> this workflow <— on the hub (not mine), or suggested —> on this post. <—

It’s just my personal choice to find my own way around it, which is by using the Loop End Column Append Node. I like to keep everything inside the workflow, and I don’t like creating unnecessary files that I won’t use. Saves me space, and avoids cluttering in the destination folder. Again, it’s just my personal choice

takbb · September 28, 2022, 9:23pm

Hi @badger101 the component should deal with it automatically. In fact it relies on the #iter pattern (or whatever pattern is typed into the component’s config dialog ) to NOT be present in the first n columns and from this it determines the value of n.

It’s intended that it be fully automatic. If your repeat columns contain the stated pattern in their name, then no matter how many columns wide or how many iterations, the component should do the rest.

badger101 · September 28, 2022, 9:29pm

@takbb If that’s so, I believe that is the solution for #2 . I’ll continue with my work tomorrow and will test it out! Thank you, in the meantime!