Downsampling of timeseries


I have very large timeseries on an Hadoop Cluster. These are Measurement with e.g 100 Hz Sample rate. I’d like to retrieve with hive this Time Series with a smaller sample rate such as 1 Hz. So I’d like to have every 100th row, and it must happen on the Cluster because the whole timeseries is too large for a download and local downsampling.

Do you have a suggestion how can I do it in KNIME ?

I see that there are suggestion for a SQL Query on Stack Overflow but I’m not good in SQL and I have also the problem that I don’t see the rowID column on my DB Query node. Shall I use the ROW_NUMBER function ?

Thank you,

What fields does the data have regarding ordering? Does it have a unix timestamp or a datetime field that you can sort by? Because since data is stored in a distributed fashion, there is no inherent order in the data like in a KNIME table, unless you have some field specifying the order. Maybe you can use that to your advantage. If you can extract a field for the second the data was taken in, you could group by that with the FIRST() aggregation function to do your sampling.
Kind regards,

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.