flattening data

Hi Everyone,

I was wondering if there are any nodes out there to "flatten" datesets. This type of functionality is useful in working with time oriented records.

The idea is to "flatten" the dataset I.e. to take multiple rows per id and create a single row per id. The user would need to specify the id column, grouping column(s) and the variable(s). The output dataset would contain one record per id and one column per grouping column(s) and original columns.

The main utility of this node would be to help in the creation of temporal attributes.
Of course there are still going to be problems with irregularly spaced observations due to the time when the entity started or stopped being tracked.

If anyone has any ideas on solving this type of issue please contribute. If no one has can we put this on the Knime wish list please?

I fumbled around with creating a good example here but couldn't and since I can't upload files I would be happy to email over a basic example file of the input and output data if anyone wants it.

Best Regards,

Jay

Hi,

I suprised how few tools out there which have data manipulation capability actually have this ability built in.

Is anyone doing this in Knime already in some way which I am missing?

Input:
Id Date Color Count Cost
1 2005-01-01 Red 5 57
1 2005-02-01 Red 2 13
2 2005-02-01 Green 7 198
2 2005-03-01 Blue 32 550
...

Output
Id Color | 2005-01-01 Color | 2005-02-01 Color | 2005-03-01 Count | 2005-01-01 Count | 2005-02-01 Count | 2005-03-01 Cost | 2005-01-01 Cost | 2005-02-01 Cost | 2005-03-01
1 Red Red ? 5 2 ? 57 13 ?
2 ? Green Blue ? 7 32 ? 198 550
...

The output is not well aligned with the long column names but I hope this show the general idea.

I would love to hear everyone's opinion.

Best Regards,

Jay

We have been looking into time series stuff for a while now (you saw some of the results, I believe...) and one of the problems is to determine if a time series should be kept in a column or a row. Additionally, when - as you point out above - you have unevenly distributed measurements over time, it is hard to squeeze several of those into one matrix either way.

The main problem is that we don't have a real application over here, which drives development so we don't quite know which direction makes more sense in order to solve real problems...

- Michael

Hi,

What approach makes sense really depends on what is going to be done with the data.

Transaction data something like:
2007-06-01 Cust001 Product1 1 500.00
2007-06-01 Cust953 Product42 100 21580.83
2007-06-02 Cust060 Product1 3 1500.00
2007-06-03 Cust953 Product85 10 45100.00

...is not a time series by the strict definition as I understand it.

If one wants to create an input dataset for a univariate time series model for example then time would stay on the N and the M would become the (aggregated) variable of interest. This the type of input for univariate ARMA, ARIMA, etc… You would have a separate series per product for example.

What I'm thinking is how to aggregate this same transaction data for input into more general models. The format would be N = customers, M equals variables which are possibly aggregates (maybe filtered somehow), current individual measurements, lags of the individual measurements or aggregates and the target would be leading values, possibly an aggregate of multiple leading values.

Now if one was to allow each customer into the dataset more then once (across time) then we are working with panel data, in my case unbalanced panel data as each customer will appear in the dataset in possibly differing frequencies. The ideal case would be to have the ability to "window" over the data to create this type of dataset following everything in the previous paragraph.

Best Regards,

Jay

I introduced my work group to KNIME yesterday.  I received a question about the ability to flatten records.  This thread speaks to the issue exactly, but it stops without an answer back in 2007.

Q: is KNIME now capable of flattening data?  Can it take multiple records per case and append n+1, n+2, etc. records and append the contents as new columns?

*****

Coincidentally, I've been trying to do something similar in R.  The challenge in R (and possibly KNIME) is handling data structures correctly; in R, an important constraint is data frames (matrices) need to match dimensions.  I've ended up writing (actually appending) flattened records of varying length (atomic string vectors) to a file, then reopening the file as a table and reading shorter records (with no data in later columns) set to a NA value.

I've heard about "flattening" in the context of networks or trees and about "pivoting" in the context of flat tables, "flat" as in having no relation with any other table.

Anyhow, you can use Pivot on its own (there is also Unpivot), or GroupBy with List aggregation function followed by Cell Splitter or Split Collection Column (there is also Create Collection Column if needed before GroupBy).

You'll end up with the same constraint as for data frames in R but KNIME will create the necessary columns for you. If you don't like that, you could create a collection column (or not if you don't need to identify separate objects) and use GroupBy with Concatenate aggregation function - you'll end up with a single string column, meaning freedom inside of the string all while respecting the table structure.

For data frames in R, you can use Hadley Wickham's tidyr or reshape2 packages.

Thanks, Geo.  I appreciate you taking the time to respond and I'll definitely check your suggestions.