Machine Learning for Time-Series with Python by Ben Auffarth

Machine Learning for Time-Series with Python by Ben Auffarth

Author:Ben Auffarth
Language: eng
Format: epub
Tags: COM062000 - COMPUTERS / Data Modeling & Design, COM004000 - COMPUTERS / Intelligence (AI) & Semantics, COM051360 - COMPUTERS / Programming Languages / Python
Publisher: Packt
Published: 2021-10-28T17:06:28+00:00


Anomaly detection

In anomaly detection, we want to identify sequences that are notably different from the rest of the series. Anomalies or outliers can sometimes be the result of measurement error or noise, but they could indicate changes to behavior or aberrant behavior in the system under observation, which could require urgent action.

An important application of anomaly detection is automatic real-time monitoring of potentially complex, high-dimensional datasets.

It's time for an attempt at a definition (after D.M. Hawkins, 1980, "Identification of Outliers"):

Definition: An outlier is a data point that deviates so significantly from other observations that it could have been generated by a different mechanism.

Let's start with a plot, so we can see how an anomaly might look graphically. This will also provide us context for our discussion.

Anomaly detection methods can be distinguished between univariate and multivariate methods. Parametric anomaly detection methods, by the choice of their distribution parameters (for example, the arithmetic mean), place an assumption on the underlying distribution – often the Gaussian distribution. These methods flag outliers, points that deviate from the model's assumptions.

In the simplest case, we can define an outlier as follows as the z-score of the observation xi with respect to the distribution parameters:

The z-score measures the distance of each point from the moving average or sample mean, , in units of the moving or sample standard deviation . It is positive for values that lie above the mean, and negative for those that lie below the mean.

In this formula, and are the estimated mean and standard deviation of the time-series and x is the point that we want to test. Finally, is a threshold dependent on the confidence interval that we are interested in – often, 2 or 1.96 are chosen for this, corresponding to a confidence interval of 95%. In this way, outliers are points that occur 5% or less of the time.

The z-score makes an assumption of normal-distributed data; however, the mean and standard deviation used in the outlier formula above can be replaced by other measures that do away with this assumption. Measures such as the median or the interquartile range (as discussed in Chapter 2, Time-Series Analysis with Python) are more robust to the distribution.

The Hampel filter (also: Hampel identifier) is a special case for this, where the median and the median absolute deviation (MAD) are employed:

In this equation, the sample mean is replaced by the (sample) median and the standard deviation by the MAD, which is defined as:

The median, in turn, is the middle number in a sorted list of numbers.

In the Hampel filter, each observation, x, will be compared to the median. In the case of the normal distribution, the Hampel filter is equivalent to the z-score, and epsilon can be chosen the same way as for the z-score.

In the multivariate case, the outlier function can be expressed as the distance (or, inversely: similarity) to a point in the model distribution such as the center of gravity, the mean. For example, we could take the covariance of the new observation to the mean.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.