Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) by Oliver Theobald

Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) by Oliver Theobald

Author:Oliver Theobald [Oliver Theobald]
Language: eng
Format: epub, pdf, mobi
Publisher: Scatterplot Press
Published: 2017-06-20T21:00:00+00:00


Figure 1: An example of k- NN clustering used to predict the class of a new data point

As seen in Figure 1, the scatterplot enables us to compute the distance between any two data points. The data points on the scatterplot have already been categorized into two clusters. Next, a new data point whose class is unknown is added to the plot. We can predict the category of the new data point based on its relationship to existing data points.

First though, we must set “k ” to determine how many data points we wish to nominate to classify the new data point. If we set k to 3, k -NN will only analyze the new data point’s relationship to the three closest data points (neighbors). The outcome of selecting the three closest neighbors returns two Class B data points and one Class A data point. Defined by k (3), the model’s prediction for determining the category of the new data point is Class B as it returns two out of the three nearest neighbors.

The chosen number of neighbors identified, defined by k , is crucial in determining the results. In Figure 1, you can see that classification will change depending on whether k is set to “3” or “7.” It is therefore recommended that you test numerous k combinations to find the best fit and avoid setting k too low or too high. Setting k to an uneven number will also help to eliminate the possibility of a statistical stalemate and invalid result. The default number of neighbors is five when using Scikit-learn.

Although generally a highly accurate and simple technique to learn, storing an entire dataset and calculating the distance between each new data point and all existing data points does place a heavy burden on computing resources. Thus, k -NN is generally not recommended for use with large datasets.

Another potential downside is that it can be challenging to apply k -NN to high-dimensional data (3-D and 4-D) with multiple features. Measuring multiple distances between data points in a three or four-dimensional space is taxing on computing resources and also complicated to perform accurate classification. Reducing the total number of dimensions, through a descending dimension algorithm such as Principle Component Analysis (PCA) or merging variables, is a common strategy to simplify and prepare a dataset for k -NN analysis.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.