Proper data structure for machine learning using contact history data

HorstenT · December 4, 2019, 3:28pm

Hi,

I am completely new to Knime and experimenting with machine learning.

I would like to predict buy/no buy decision. I have a following data set:
Client ID, contact date, result of contact, type of product sold.

Basically for the same client ID, I have more than 1 contact trials. Some contacts result in purchase, some not. For example, within a specific time period, I could have 10 contacts with specific client, but only 2x he has purchased anything.

Problem:
I do not how to handle “contact history” to prepare an input table for a predictor/e.g. decision tree. In KNIME examples, like “churn prediction”, in input tables 1 client = 1 row. In my case: 1 client = mulitple rows (this is because each contact generates a new record/row in CRM). How to best approach it? Shall I built some supporting statistical indicators such as “average time between contacts” to transform my table to 1 client=1 row table? or somehow aggregate these rows into one. These seem highly impractical.

Thank you for any suggestions
HT

HansS · December 4, 2019, 6:45pm

Hi @Horsten, welcome to the forum

Some thoughts about your question.Take some time to have a clear view on your business question, before you translate it to a data science question What is it exactly what you want to predict?

if a customer will by or not
if a contact will lead to a sail
E.g. In a certain situation? ; within a period of time? ; or ?

I guess your question is more like the first one, and in that case it is best to end up with a table based on individual customers. The way to come from transactions to individual customers is the Pivot Node.
You will end up with a table like:
customer_id ; contact_1_features, contact_2_features (…) contact_n_features

Another thing to consider, is if there is an order in the contacts, make sure that the every group of variables in the contact_features describes the same event (e.g. first visit; or first time send brochure…).

The next step is to derive new features within a group of contact_features and between groups of feature_contacts (e.g. time between the contacts). Maybe you can create some RFM variables (Recency, Frequency, Monetary value)

And yes this is highly impractical. But data understanding, cleaning and preparation costs a lot of time and effort, but will result in better predictions.

Hope this helps
gr. Hans

beginner · December 5, 2019, 7:02am

Like Hans said you need to accordingly transform the data. It’s hard to say how exactly. Depends on your exact problem.

This leads me to believe that time matters? You could do columns for each time period with number of contacts, items solds, total sales value, etc.

What will sure also matter is sales/number of contacts. How many contacts are needed till you make a sale for customer x? maybe this is even the most important factor.

EDIT: note that for this to work you need to have many customers with many contacts and multiple sales. If most customers ever made only 1-2 buys, it’s hard to make any conclusions from these.

ipazin · December 5, 2019, 1:59pm

Hi there @HorstenT,

welcome to KNIME Community and machine learning!

To add on to already well rounded suggestions and questions/ideas from @HansS and @beginner. If you want to predict if client will make a purchase or not you need explanatory variables that can lead you to that conclusion. Only explanatory variable in your data set you have (or at least in one you presented) is date and it is not so explanatory. So I would suggest you try to get more variables on your clients to predict their behavior. You have to ask yourself what might impact client’s decision to buy or not to buy your product(s). Some answers might be product type, price, seller, client’s need for it, time of day, try number x to sell same product…

Hope this helps.

Br,
Ivan

HorstenT · December 6, 2019, 11:06pm

Thank you both for your comprehensive comments, seems I have to return back to school:) Your comments inspired me to spend more time on data understanding/exploration, building hypotheses (so basically some data mining) and then think about modelling.
Best,
HT

system · June 6, 2020, 11:06am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.