I am really confused. In the node description of the Decision Tree learner is stated:
"Most of the techniques used in this decision tree implementation can be found in "C4.5 Programs for machine learning", by J.R. Quinlan and in "SPRINT: A Scalable Parallel Classifier for Data Mining", by J. Shafer, R. Agrawal, M. Mehta ".
The questions that I have are the following:
1. The algorithm that is used in this node is, C4.5 or SPRINT or CART?
2. When you choose Gain Ratio at Quality measure, you automatically are applying a C4.5 algorithm?
3. Same question only then for Gini index?
4. My final question is about the number of threads, I do not really get what this is about.
I hope that someone can help me.
I had a look at the code and think I can answer some of your questions:
- From what I see, the algorithm is not a real SPRINT because it does not presort the data per column. It does, however, split into multiple threads to build the branches of the tree in parallel. It can handle missing values and prunes the tree once it is built, so I think this is mainly the C4.5 algorithm, just parallelized.
- I think always the same algorithm is applied, just with a different split criterion.
- The number of threads specified for how many splits a new thread is started. After the first split, we have 2 threads, after splitting one of the branches, we have 3. Once we reached the set number of threads, the branches are split without creating additional threads.
I hope I could help you with that.