Data Smart by Jordan Goldmeier

Data Smart by Jordan Goldmeier

Author:Jordan Goldmeier [Goldmeier, Jordan]
Language: eng
Format: epub
ISBN: 9781119931393
Publisher: Wiley
Published: 2023-09-20T00:00:00+00:00


SPLITTING A FEATURE WITH MORE THAN TWO VALUES

In the RetailMart example, all the independent variables are binary. You never have to decide how to split the training data when you create a decision tree—the 1s go one way, and the 0s go the other. But what if you have a feature that has all kinds of values?

For example, let's say you work for a large email service, and you want to know if an email address is alive and can receive mail. One of the metrics used to do this is how many days have elapsed since someone sent an email to that address.

This feature isn't anywhere close to being binary! So if you train a decision tree that uses this feature, how do you determine what value to split it on?

It's actually really easy.

There's only a finite number of values you can split on. At max, it's one unique value per record in your training set. And there's probably some addresses in your training set that have the same number of days since you last sent to them.

You need to consider only these values. If you have four unique values to split on from your training records (say 10 days, 20 days, 30 days, and 40 days), splitting on 35 is no different than splitting on 30. So, you just check the impurity scores you get if you chose each value to split on, and you pick the one that gives you the least impurity. Done!



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.