Suggestions for a workflow using time series data

3_viktor · October 16, 2019, 12:49pm

Hello everybody,

this is my first ever workflow I am building using KNIME and also my first Data Mining task so I hope I get some help here.

The data is coming from a streaming plattform. The data I am using are some logging data (data when customers have problems with the streaming service). My knowledge in this topic is not the best but I will try to explain my current progress and what my task is. In the first steps I made some data preparation. After this step I have following data structure:

eventTime - time when error occured
deviceId - device identifier
errorCode - specific code, which can be used to get more information of the error

In the first step I created time ranges (15 minutes). For each time period I counted the errors. Now I want to make some time series analysis. I need some algorithms/strategies to detect:

trends
jumps (e.g. count was static 5, but suddenly jumped to 10 and it stays at 10)
peaks/anomalies (static 5, suddenly 15 for on range, then back to 5

Do you have some suggestions which methods I could use for each of the points? One more problem is that I need to do a analysis for every error type (there are about 20000). Looking at each graph would take too much time. For example: I want the top 30 errors which have an upgoing trend. Is there a solution for this? It would be helpful if a algorithm just tells me wether the error makes problems or not.

EDIT: One more problem I have is that every day there is a seasonality. In the evening there are more errors than the rest of the day.
Here is a image showing a week:

It would be nice if someone could help me with my problem. Thanks in advance!

Best regards

Viktor

Corey · October 22, 2019, 3:16pm

Hi @3_viktor, welcome to the forum! Sorry it’s taken a minute to get back to you.

I think you’ve described what you’re looking to do fairly well, but there’s ton’s to say on this topic so I’ll just try to summarize some ideas that might be useful for now.

I’m glad you mention the seasonality, it’s definitely going to be helpful to remove that before getting into the other analysis. A common way to handle this is with differencing, this means that if you know that your data has a 24 hour seasonality you can take every data point in subtract from it the data point 24 hours prior. This makes it much easier to see other patterns in your data.

Now you can do that with a combination of the lag column node and the math formula node but we also have a set of components we recently published. They can be found on the examples server under components > time series.

The inspect seasonality will generate an ACF plot of your data, this is helpful to verify that your data has a 24 hour seasonal component. Then you can use the Remove Seasonality component to remove it.

Once that is removed it should be much easier to see anomalies and weird spikes in your data.

You also mention wanting to look for a trend. The most direct way to do this would be with a linear regression model, but you could also try aggregating your by week or even month and graphing it if the goal is simply to gain understanding.

Finally you talk about anomaly detection, but I see you give a specific example. A simple option could be to use a rule engine node to check if your data stays above some threshold, but if you’re looking for a more general anomaly detection technique auto encoders function well in settings like this. Those being models that are just designed to reproduce the input. We can talk more about this if that’s a direction you want to go.

Let me know which parts of this sound useful or interesting, and we can talk more specifically.
-Corey.

3_viktor · October 22, 2019, 4:52pm

Hey @Corey,

thank you for your reply. Since I posted the question I was already playing around with some nodes. I am now more into KNIME and how it works. I will check out the seasonality inspector and remover. For trend analysis I already tried linear regression using it on daily time ranges. With removing the seasonality it should work fine aswell.

And because of the anomaly detection. I thought about using DBSCAN. Is it a good choise? It detects clusters and noise. The noise objects are in my understanding outliers. I will also try the rule engine and also Keras Autoencoder.
My problem is that I am doing this for my bachelor thesis and time is passing by so quick. Next week is already my last week. I hope I find a second solution besides the linear regression.

-Viktor

Corey · October 22, 2019, 5:43pm

If you’re looking for anomalies across error and/or device ids dbscan is a fine place to start, as you’ll have many features.

If you’re looking for anomalies in individual time series it may not be ideal. So it just depends on the exact goal.

If you do decide to go with dbscan just be sure to normalize your features.

For completeness, one other interesting unsupervised option is the isolation forest which is included in the H2O implementation. If the goal here is to write a paper it could be interesting to compare them.

https://kni.me/n/dwNoOWB9Mv_flTTm
https://kni.me/w/M4L07k5M8Voy-tui

best of luck on your project!

3_viktor · October 22, 2019, 5:48pm

Thank you very much for your quick reply. I will look into it.

Actually I am comparing different strategies. DBSCAN and Isolation Forest do more or less the same thing, they are searching for outliers. What I am evaluating are different approaches on the data mining problem. I don’t have much time left so I hope trend analysis and outlier detection will do it.

Thanks for your help.

-Viktor

system · October 29, 2019, 5:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.