multi row formula

Hi team,

facing problem in middle of workflow and the problem is i have to use expression with unique rows and lag column(i mean -1 row. )

so basically i have scenario where some data has duplicate value but with unique identifier ( there are also duplicates also with unique ids) and i have to merge two column before grouping it to get unique data and than i have to use some expression like:-
{IIF(isnull([Row-1:RecordID]), 1,[Row-1:RecordID]+1)}

i had used lagged function to get -1 records and the problem is lagged worked for all the records so i have used chunk loop and lagged function to get -1 record for specific groups and column expression for condition.

now the problem is, i have some where around 70K groups and i can use only 1 chuck row for specific groups. so for 70K it will take around 2chatr0 hours. 1 record 1 sec.

is there any other option i can go with

Hi @Hammy

Could you share a small data sample / wf depicting what you are doing? :slight_smile: I have to admit I didn’t completely get it yet
First thought though I had is to increase the RAM that KNIME AP is allowed to use (see the first section in THIS blog post).

Alice

2 Likes

Sorry for not being able to explain it to you; however, my workflow is in Alteryx, and I am migrating to Knime. and answer to your question about ram, i have given dedicated 16GM ram to knime.

so basically there are few formula which is being used.

  1. grouping 2 columns with condition {IIF(isnull([Row-1:RecordID]), 1,[Row-1:RecordID]+1)}
    recordID is index for unq identifier

  2. creating new column (new end_date)–IIF([RecordID]=1, [end_date],

IIF(isnull([end_date]), null(),

IIF(isnull([Row-1:new end_date]), null(),

IIF([end_date]>=ToDate(Datetimeadd([Row-1:new end_date],1,‘days’)), [end_date], [Row-1:newend_date]))))

let me try to create a demo workflow

Multi row Alteryx to KNIME conversion was mentioned on the forum some time ago as well.

Maybe this already helps you further and otherwise indeed a sample with input and expected output.

1 Like

unable to upload zip file

The forum restricts the size of uploads. It’s best to create a sample workflow with some anonymized data, export it, upload the .knwf and draft your expected output. When exporting, uncheck the option to reset the workflow.

mig_alt_kni.knwf (33.2 KB)

if you want we can connect on call to resolve this

@ArjenEX did you get any solution.

No, I had a look but don’t fully understand what you’re trying to achieve.

Some question that come into mind:

  • What is your expected output?
  • What is “value” in your case? It isn’t named as such in your workflow.

  • Is RecordID the same as _id?

  • IIF([end_date]>=ToDate(Datetimeadd([Row-1:new end_date],1,‘days’)), [end_date], [Row-1:newend_date])))) For someone without Alteryx syntax knowledge, this is a bit tricky to understand.

Based on your chunk approach I suspect that you want the evaluate the end dates per group as identified by the Mix column.

First group as example:

Per “Mix” group, the latest end date should be taken and the other row(s) should be updated accordingly? Something like this:

image

Please provide more guidance and what your expected output is.

Basically I need to apply the formula on a particular group. For example there are 4 rows in a particular group so I need formula to check for the conditions in the particular group and apply (row - 1 formula )on this 4 rows after this is done it should consider another group.

Please refer new end date column as expected output.

or we can get on quick call today?

That can be done with a Group Loop. There are plenty of examples on the Hub but the basic principle is that it will process all rows which belong to the same group as per your selected input column (most suitable memberid or Mix)

2 Likes

if we use group loop start, it will take somewhere around 10–15 hours to execute the workflow.
may be i am missing out something, that need to be figured it out.
we can not depend on loop chunk or group loop.

Hi @denisfi

please refer to this workflow

@Hammy maybe you could try and formulate your problem again. Do you want to detect the latest date per group? Or is it something else? Also maybe you leave out all columns not necessary to solve the tasks and name the columns you will need accordingly so we might be able to understand.

2 Likes

I’m also a bit unclear on what is being looked for here.

It might help if the purpose of the function could be described in words instead of trying to write it as a formula (eg 'if the previous row and current row are in the same group, use the end date from the previous row, otherwise use the end date from the current row.).

A few general ideas suggest themselves to me:

  • If you were after the same value for each member of a group, you may be able to use a group node to find some aggregate value (last end date, first end date etc) for each group, then use a join node to add the value back to each row.
  • The Column Expressions node allows you to reference values in previous or next rows. You need to turn this on in the advanced tab in the node, but it would allow you to perform calculations based on the current and previous row.
1 Like

Thankyou @mlauber71 @justin.m @ArjenEX for your support. We’d try all your suggestions and solutions that are available in the forums.

But the solution that we’re looking for is not available in the forum, and some similar topics are being reported, and those are not helpful, as I have already tried them. I believe this is a new problem that has never been reported.

We were able to solve that problem, but it will take 15 to 20 hours to complete. However, we can’t depend on it. We need solutions that can optimize the workflow; is there any other way we can use it?

FYI: The formulas that we use are complicated.

@Hammy I would still suggest you try and formulate your problem with a clearly described example illustrating what kind of formula you would need.

If pure computing power is the issue you could try parallelization combined with group loop. It might need some experimenting.

1 Like

From looking at your original post, it looks like you are trying to do something like “if a row is missing RecordID, increment the RecordID from the previous member of the same group by one, otherwise use 1.”

If so, one of the problems may be that using the row offset in column expression node only returns values from the input table, it won’t return a new column or updated value that happened in the node itself.

Given an input table like this:

Ordering Group RecordID
1 A 1
2 A 2
3 A ?
4 A ?
5 B 3
6 B 1
7 B ?
8 C 5
9 C 6
10 C ?
11 C ?
12 D 7
13 D ?
14 D ?
15 E ?

One possiblity would be to update the RecordID with a column expression like:
if (column(“RecordID”)==missingValue())
{
if (column(“Group”)!=column(“Group”,-1))
{1}
else
{
if(column(“RecordID”,-1)!=missingValue())
{column(“RecordID”,-1)+1}
}
}
else
{column(“RecordID”)}

Assuming your input is sorted by group, this will return the existing RecordID if there is one. If there isn’t it will return a 1 if it’s the first item in a group, otherwise it will increment the previous row’s RecordID.

The only problem is that if there are multiple missing values, one after the other, it will only fill the first one. But you can use a recursive loop to run the same column expression node repeatedly until every value is filled in. It’s still using a loop, but I’m pretty sure it will be faster than a group loop.

This is what I ended up with:

Ordering Group RecordID
1 A 1
2 A 2
3 A 3
4 A 4
5 B 3
6 B 1
7 B 2
8 C 5
9 C 6
10 C 7
11 C 8
12 D 7
13 D 8
14 D 9
15 E 1

All the missing RecordID values were filled in after 2 iterations.

You may be able to use a similar approach to make other formulae work.

Depending on what you are trying to do, a Rank node might also be useful. It would let you enumerate items within a group, and you could then perform calculations using those numbers.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.