Machine Learning with R - Fourth Edition by Brett Lantz

Machine Learning with R - Fourth Edition by Brett Lantz

Author:Brett Lantz
Language: eng
Format: epub
Publisher: Packt
Published: 2023-11-15T00:00:00+00:00


For a more in-depth look at the clusters, we can examine the coordinates of the cluster centroids using the teen_clusters$centers component, which is as follows for the first four interests:

> teen_clusters$centers

basketball football soccer softball 1 0.362160730 0.37985213 0.13734997 0.1272107 2 -0.094426312 0.06691768 -0.09956009 -0.0379725 3 0.003980104 0.09524062 0.05342109 -0.0496864 4 1.372334818 1.19570343 0.55621097 1.1304527 5 -0.186822093 -0.18729427 -0.08331351 -0.1368072

The rows of the output (labeled 1 to 5) refer to the five clusters, while the numbers across each row indicate the cluster’s average value for the interest listed at the top of the column. Because the values are z-score-standardized, positive values are above the overall mean level for all teenagers and negative values are below the overall mean.

For example, the fourth row has the highest value in the basketball column, which means that cluster 4 has the highest average interest in basketball among all the clusters.

By examining whether clusters fall above or below the mean level for each interest category, we can discover patterns that distinguish the clusters from one another. In practice, this involves printing the cluster centers and searching through them for any patterns or extreme values, much like a word search puzzle but with numbers. The following annotated screenshot shows a highlighted pattern for each of the five clusters, for 18 of the 36 teenager interests:

Figure 9.12: To distinguish clusters, it can be helpful to highlight patterns in the coordinates of their centroids

Given this snapshot of the interest data, we can already infer some characteristics of the clusters. Cluster four is substantially above the mean interest level on nearly all the sports, which suggests that this may be a group of athletes per The Breakfast Club stereotype. Cluster three includes the most mentions of cheerleading, dancing, and the word “hot.” Are these the so-called princesses?

By continuing to examine the clusters in this way, it is possible to construct a table listing the dominant interests of each of the groups. In the following table, each cluster is shown with the features that most distinguish it from the other clusters, and The Breakfast Club identity that seems to most accurately capture the group’s characteristics.

Interestingly, cluster five is distinguished by the fact that it is unexceptional: its members had lower-than-average levels of interest in every measured activity. It is also the single largest group in terms of the number of members. How can we reconcile these apparent contradictions? One potential explanation is that these users created a profile on the website but never posted any interests.

Figure 9.13: A table can be used to list important dimensions of each cluster

When sharing the results of a segmentation analysis with stakeholders, it is often helpful to apply memorable and informative labels known as personas, which simplify and capture the essence of the groups, such as The Breakfast Club typology applied here. The risk in adding such labels is that they can obscure the groups’ nuances and possibly even offend the group members if negative stereotypes are used. For wider dissemination, provocative labels like “Criminals”



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.