Output collection

srao · May 12, 2014, 10:02am

I want to do the following logic on a input file ( .csv) which has 3 columns MEMBER ID, EFFECTIVE DATE and END DATE.

For each row

For each month ‘c’ in between EFFECTIVE DATE and END DATE

WRITE <MEMBERID, c>

Sort the data on ‘c’

For each ‘c’

Count the no of unique MEMBER ID

WRITE < c, COUNT>

I do not know if this can be done using the existing loop support in Knime , becuase I believe this can be only possible looping in the input row.

Ellert_van_Koperen · May 12, 2014, 11:53pm

Just to understand correctly what you are asking:

You have a set of intervals A (in a csv) and a fixed set of months or dates B.

For each month/date B you want to know for how many cases it falls within the interval A.

srao · May 13, 2014, 8:52am

Ellert van Koperen - My requirement does not have two files A and B. It has only one file which has three columns < Member ID, Eff Date, End Date>.

However what I want is similar to what you have described, but I would like to apprise you of the volumes of files A and B here

A can have records in 1's to 10's of millions - for ex say 4 M.

B can have atleast 4 years of fixed set of monthns - 48 records

For each month, you are checking if B falls in interval of A, ie 48 * 4M = 192M comparisions which is not at all an optimal approach.

If you know any approach to reduce the no of comparisions please let me know.

Ellert_van_Koperen · May 13, 2014, 10:01am

I indeed see the issue here. One tricky point to add is that you mentioned you want to count the unique member_id's. Meaning they might clash/overlap. Interesting problem...

There are basically two approaches to this problem:

- write a program or script that reads the csv one row at a time and does all the operations on each row all at once. This would perform pretty well.

- do it the knime way: expand your dataset, group by. This would perform rather awefull but would be easy to build. And as computerpower is cheap, and the time of a human is not, just throw more power at it. Crude but effective.

So, are you going to do this on a regular basis, or is this a one time operation?
If it is a one time thing, i would say just let knime chew on that data for a weekend.
If on the other hand you need to do this on a regular basis it would be worthwhile to invest the time to make a dedicated program for it.

The logical Knime approach would be to 'explode' the dataset, in other words to convert the intervals to separate datapoints for each month that they cover. After that operation you will have a very very large dataset indeed.
This might cause problems, that you can alleviate by chopping up your dataset in chunks and do the processing on these chunks one at a time. In that you need to be carefull to keep the same member_id's together in one chunk, as otherwise your counts will be off target.
Then apply a filter for each month (so split the data in N sets) and group each set on the member_id. You might be able to use a loop for that, though the size of the sets will not help.

srao · May 14, 2014, 9:06am

Ellert van Koperen - This is a repetitive process. This would be done atleast the end of every quarter.

I understand you Knime approach, but I do not know how to implement the 'explode' part in Knime. This operation is a pivot & expand like operation and I cannot find a suitable node in Knime. To illustrate, I need a Knime logic which can take a single record

<MEM1001,01/01/2011,12/31/2011> as Input and would return

<MEM1001,01/01/2011>

<MEM1001,02/01/2011>

<MEM1001,03/01/2011>

.....

<MEM1001,12/01/2011> as Output.

Ergonomist · May 21, 2014, 4:41pm

srao,

Outline solution from what I interpret:

Split input column into array of 3 (prefix, start date, end date)
Parse texts to dates (clean or handle 3rd column's closing bracket)
Row to variable loop start, feed vars into "time generator" node
- Variable-controlled start and end date, abundant number of rows to make sure there are always enough
- (maybe) Mask time
- "GroupBy" date field
Loop end

Not extremely efficient, but it should work for reasonable input table sizes.

Cheers
E