I indeed see the issue here. One tricky point to add is that you mentioned you want to count the unique member_id's. Meaning they might clash/overlap. Interesting problem...
There are basically two approaches to this problem:
- write a program or script that reads the csv one row at a time and does all the operations on each row all at once. This would perform pretty well.
- do it the knime way: expand your dataset, group by. This would perform rather awefull but would be easy to build. And as computerpower is cheap, and the time of a human is not, just throw more power at it. Crude but effective.
So, are you going to do this on a regular basis, or is this a one time operation?
If it is a one time thing, i would say just let knime chew on that data for a weekend.
If on the other hand you need to do this on a regular basis it would be worthwhile to invest the time to make a dedicated program for it.
The logical Knime approach would be to 'explode' the dataset, in other words to convert the intervals to separate datapoints for each month that they cover. After that operation you will have a very very large dataset indeed.
This might cause problems, that you can alleviate by chopping up your dataset in chunks and do the processing on these chunks one at a time. In that you need to be carefull to keep the same member_id's together in one chunk, as otherwise your counts will be off target.
Then apply a filter for each month (so split the data in N sets) and group each set on the member_id. You might be able to use a loop for that, though the size of the sets will not help.
Ellert van Koperen - This is a repetitive process. This would be done atleast the end of every quarter.
I understand you Knime approach, but I do not know how to implement the 'explode' part in Knime. This operation is a pivot & expand like operation and I cannot find a suitable node in Knime. To illustrate, I need a Knime logic which can take a single record
<MEM1001,01/01/2011,12/31/2011> as Input and would return