Help choosing analytics algorithm

mlauber71 · June 28, 2018, 4:15pm

I think you could do two things. You could treat your outlier group as label/target (1/0 or TRUE/FALSE) and the rest of the data as explaining variables. You would have to remove the time column since this ‘leaks’ the information you want to find. You could then use an algorithm like Random Forest Learner *1) that also gives you a list of the most important variables, the ones that make the outliers the outlier compared to the regular cases.

You could also just use the time for each transaction and try to directly predict it (but would have to remove the outlier 1/0 marker). Again you could look for the most important variables.

As a preparation you could try to use a correlation matrix to see which variables have a high correlation towards the length of the operation:

If you are looking for easy to read and interpret rules you could try to start with a simple Decision tree. You can see if the rules that produce the highest scores tell you something. Depending on your case this approach might not be enough or there might not be a clear differentiation. Always be careful if the results are somehow too good to be true:

A good metric to determine the quality of you model ist Gini or AUC for 1/0 predictions and Root-mean-square deviation (RMSE) for numeric predictions.

*1)
https://nodepit.com/workflow/public-server.knime.com%3A80%2F04_Analytics%2F13_Meta_Learning%2F02_Learning_a_Random_Forest