The Applied Data Science Workshop, Second Edition by Alex Galea

The Applied Data Science Workshop, Second Edition by Alex Galea

Author:Alex Galea
Language: eng
Format: epub
Publisher: Packt Publishing Pvt. Ltd.
Published: 2020-07-21T00:00:00+00:00


Figure 4.17: A decision tree from a Random Forest ensemble, where max_depth=5

From the preceding graph, we can see that each path is limited to five consecutive nodes as a result of setting max_depth=5. At each branch, scikit-learn's decision tree algorithm has decided on the feature split that maximizes the separability of classes in the training data. Consider the following section of the tree:

Figure 4.18: A section of the decision tree where a split is made on the last_evaluation ≤ 0.445 condition

Here, we can see that 1,926 training samples from the top node have been split on the last_evaluation ≤ 0.445 condition, resulting in a child node that's pure (on the left) with 208 "no" samples, and a child node that's mixed (on the right) with 1,544 "no" samples and 1,149 "yes" samples. Recall that "no" corresponds to employees who are still working at the company, while "yes" corresponds to those who have left.

The orange boxes represent nodes where the majority of samples are labeled "no", and the blue boxes represent nodes where the majority of samples are "yes". The shade of each box (light, dark, and so on) indicates the confidence level, which is related to the purity of that node.

Note

To access the source code for this specific section, please refer to https://packt.live/30FSdOZ.

You can also run this example online at https://packt.live/2ACdbUc.

This concludes our exercise on Random Forests and takes us to the end of our initial modeling research on the Human Resource Analytics dataset. In this exercise, we learned how to train Random Forests and explored how their decision tree constituents are composed.

Although we trained a variety of models in this section, we only worked through one end-to-end example where data was loaded, split into training and testing sets, used to train a model, and then scored. After that, we relied on previous work to make our modeling process simple.

In the next section, you'll have the opportunity to work through a full modeling activity, from loading the preprocessed dataset to scoring and comparing the final results.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.