I am new to predictive analytics and machine learning. I have been looking at alot of use cases online and via KNIME on classification algorithms eg Random Forest. What I have noticed in alot of the sample data is that the records already come classified. An example would be KNIME’s churn prediction model using decision trees. In the sample data there is already a field classifying the customer as one who would churn so why would we run this through a model when we already know the outcome of which customers have churned or not. I have seen the same scenario with sample datasets online where they already classify the data. In short I am not understanding why the datasets already come classified and why do we run them through a classification model if we already know the outcome?
When you do a classification you already know the available classes in case of the churn data set you have two classes churned and not churned. When you learn a classifier you have to know the classes in the first place. You can use this model for new customers to predict if they are likely to churn or not based on the observations in which you know that a certain customer churned.
This is just a high level explaination from an novice in data science, hope this helps anyway.
Thank you…This affirmed what I had assumed.Which leads me into my next question. How do we now feed in new data for the model to actually predict…In the Churn example quoted above , we already know the customer churned or not.So how do we implement on KNIME onto new data for new customers where we do not already know the outcome ?
Thanks in advance for your assistance
You could have a look at the KNIME Hub, KNIME Hub, where you can search for complete workflows as well as nodes and have a look in which workflow a certain Node is used.
There is also a example workflow on the churn data set using Decision Trees.
Churn Prediction Workflow
As @laaaarsi already pointed out, the idea is that you use a dataset, where you know the correspondence between some features (for example, customer details) and the variable that we want to model (for example, will the customer churn in the next month or not). In KNIME this is done with Learner nodes. Conceptually, these Learner nodes learn some pattern or dependence of the variable of interest wrt input features. Then you can apply those patterns to new data, in other words, to data, that do not have the variable of interest. This prediction step is done with Predictor nodes.
In general, I would suggest that you have a look at out e-learning course, and in particular the Data Mining section
Hi there all,
@ElijahL08, @laaaarsi welcome to KNIME Community!
Well, some datasets are already classified. And that is a good thing. Now, why run data for which you already know output through a classification model - to be able to see how well our model is preforming (training time!). If it is scoring with high accuracy then we can run our model on data for which we do not know output (deploy it), if it is not scoring well we should try changing something in our model or try different model (back to training!).
Thanks for the help All. I have more clarity now!
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.