Help choosing analytics algorithm

pcuff · June 27, 2018, 3:02pm

Hi,

I’m new to machine learning, and am trying to apply it to some data I have, but I don’t know which method or algorithm to select for the problem I’m trying to solve.

I have 3 datasets:

Set of transactions. Each record consists of a unique ID for the transaction a timestamp of when the transaction was processed, the time of how long it took to process the transaction, and the size of the transaction. The size of the transaction is the primary feature that influences the time of the transaction.
Set of transactions that are outliers for their size. For example, the expected time for a transaction of size 1 is 100-150ms; this dataset will have all transactions with times that exceed the expected times for their size (like all transactions of size 1 with times > 150ms).
Set of input data for each transaction. Each transaction contains lots of data (features?) that gets processed, like codes, costs, type, etc.

What I’m trying to do; for the transactions that are outlers, determine which features of the input data may be correlated to the high processing time; Do all the outlier transactions with size 1 have a data feature in common that may be causing their unusual processing time?

Is this a good candidate for applying a machine learning method/algorithm?

If so, what’s the best way to proceed?

Thanks

izaychik63 · June 27, 2018, 8:12pm

You can try Logistic Regression or wait for a better recommendation.

mlauber71 · June 28, 2018, 4:15pm

I think you could do two things. You could treat your outlier group as label/target (1/0 or TRUE/FALSE) and the rest of the data as explaining variables. You would have to remove the time column since this ‘leaks’ the information you want to find. You could then use an algorithm like Random Forest Learner *1) that also gives you a list of the most important variables, the ones that make the outliers the outlier compared to the regular cases.

You could also just use the time for each transaction and try to directly predict it (but would have to remove the outlier 1/0 marker). Again you could look for the most important variables.

As a preparation you could try to use a correlation matrix to see which variables have a high correlation towards the length of the operation:

If you are looking for easy to read and interpret rules you could try to start with a simple Decision tree. You can see if the rules that produce the highest scores tell you something. Depending on your case this approach might not be enough or there might not be a clear differentiation. Always be careful if the results are somehow too good to be true:

A good metric to determine the quality of you model ist Gini or AUC for 1/0 predictions and Root-mean-square deviation (RMSE) for numeric predictions.

*1)
https://nodepit.com/workflow/public-server.knime.com%3A80%2F04_Analytics%2F13_Meta_Learning%2F02_Learning_a_Random_Forest