Plot Decision Trees Boundaries

mauuuuu5 · October 14, 2016, 4:42am

Hi everybody, I was looking some slides posted by Dean Abbot about Model Ensembles (see this URL http://bit.ly/2daRspH), and I saw the slide 18 that shows the boundaries for several models.

https://s14.postimg.org/71gc1huk1/Slide.png

I wonder how can I plot such boundaries. Beforehand I assume that it must be a scatter plot.

I'am also attaching an example workflow, but I could manage to make the above plot

Thank you

01_example_for_learning_a_decision_tree.knwf

Ergonomist · October 15, 2016, 10:01am

Hi,

You may want to use a data generator node - the "quasirandom data generator" node creates particularly density-equilibrated patterns for this purpose.

E

mauuuuu5 · October 16, 2016, 1:44am

Thank you E, I managed to make such plot with the Data Generator Node, but know I wonder if such plots can be done using actual data or just with Random Data?

In attached image there is a comparisson of Actual Data, Normalized Actual Data and Random Data.

https://s10.postimg.org/d7ot9rht5/Captura.png

I am also uploading the example workflow.

Best Regards

01_example_for_learning_a_decision_tree.knwf

michaelstansky · October 16, 2016, 4:53am

You can use the SMOTE node to resample from the data. SMOTE is usually used to resample from underweighted classes, but you can resample generically. Then use the Stresser node to noisy up the columns that you will be using on your X and Y. Then apply your predictor to that new, noisy resampled data.

Mike

https://www.linkedin.com/in/michael-stansky-2590465

Ergonomist · October 16, 2016, 12:05pm

Mauuuu,

As I struggle with finding robust webspace to upload solutions, just a textual comment on your workflow:

Learning a new model from generated data leads to the un-equal response areas your screenshot shows. In order to get the original model's response areas, you need to "predict" the generated data's values with the orignally trained model, and not use a not a re-trained one. I guess that's what you referred to with "using actual data" - just use the model trained on actual data, but feed it with random dots from across the entire spectrum of the orginal data's range.

I presume Mike's approach woul create a "fuzzy" picture similar to your orginal data lines (plus some resampled points), but not a data range canvas filled entirely with coloured dots (which I believe you are still after).

Cheers
E

mauuuuu5 · October 17, 2016, 4:15am

Thank you Michael and E, I am attaching the picture of the result, https://s13.postimg.org/8i2jpz12f/Image.png

Regarding Michael approach you are right, but I do not understand in which set of data should I apply the decision tree rules. In this case I am applying those on the 100% of the data, not in the train neither on the test data. You mean that I should apply the rules on just the trained data?

I am providing a Google Drive link of the workflow as it weights 12 MB including the data mostly due to the fact that the SMOTE node took a while to run. http://bit.ly/2ekhanK

Best Regards

Ergonomist · October 17, 2016, 9:33am

Mauuuuu,

During most of the week I'm also behind a firewall preventing me from uploading anything anywhere - so let me try to explain this a little better:

You generate (quasi-)random data across the full range of actual data
You split the actual data into training and test set
You train the model on the training set
You predict not only the test set, the training set and/or the full set, but also the random data.

The final element of step 4 (predicting the random data) should give you the desired "prediction area/boundary" graphs. HTH - if not I'll try to mod your workflow from home sometime this week.

Cheers
E

mauuuuu5 · October 17, 2016, 8:15pm

Thank you E, if you can help me with that.. I will appreciate it ..just to be sure.

Best regards

Mau

Ergonomist · October 18, 2016, 12:52pm

Hi,

Almost there - I managed to get the solution finished, but unfortunately it's firewalled off, so I cannot share the workflow. Let's start with a snapshot, then:

https://docs.google.com/uc?export=download&id=0B2wdJk9swWpQVkdZekllbFQ5MncwQ25vbGVkRGg5NHVjS2NV

The result looks like this - would have been prettier to plot Col0 and Col8 which are covered by rules ;-), but I went with your example for recognisability:

https://docs.google.com/uc?export=download&id=0B2wdJk9swWpQUzJsQmZXeFBaSkE

If you still need it in workflow form just let me know - shouldn't take too long to repeat the exercise on my home setup. :-)

Cheers
E

Geo · October 18, 2016, 6:30pm

Thank you for sharing, E, impressive the efficiency of this apparently simple approach! I wouldn't have thought about the first step. Do you know of any recommended book or article for further reference ?

Kind regards

Geo

Ergonomist · October 18, 2016, 9:43pm

Geo,

Not sure it's terribly efficient (for large data sets the I/O overhead involved might be prohibitive), but it's quick and easy to build for sure. I'm not really aware of any articles or such, except the broader field of sensitivity analyis perhaps. That's where the quasirandom methods for model analysis come from, described in policy analysis research here:

http://www.andreasaltelli.eu/file/repository/Saltelli_Lesson_Sens_Analysis.pdf

Involves loads of "exaggerated political faith in prediction models" lingo, so as a professional data skeptic it's a heart-warming read. ;-)

Cheers
E

Geo · October 19, 2016, 5:46pm

Thanks !

mauuuuu5 · October 20, 2016, 2:55am

Hi E, thank you very much I managed to plot the decision tree boundaries with columns 0 and 8.

https://s16.postimg.org/j6e3u28et/Image.png

I am also attaching the workflow link in Google Drive in case that someone may need it

https://drive.google.com/file/d/0B7RFEz-M2e3TZnhwVnBRNjJHSXc/view?usp=sharing

Thanks again

Mau