
Exploring the Impact of Outliers in Statistical Analysis
Aug 23, 2024
2 min read
0
4
0
Preparing your data for statistical analysis, in short, means eliminating factors, in a methodical and theoretically defensible way, that could distort the mathematical results you receive from your analysis.
Statistics, such as the mean, standard deviation, etc., can be shown to follow a (standard) normal distribution, where 99.7% of the sample values fall within 3 standard deviations of the mean.
Removing extreme data points (outliers)
An outlier is simply an atypical data point. Imagine a Bill Gates walking into a room during a comparison of education and net worth. This single point would influence the linear regression equation greatly, to the point where it no longer made sense. Sometimes outliers are also called influential observations for this reason. Outliers can have extreme influence on linear regression. Outliers are those observations that are far away from the main cluster of data, either in the y direction, or in the (or one of the) x direction(s).
Here are some graphs of OLS regressions, along with their residual lines, that show the effect of outliers[1].

(1) There is one outlier far from the other points, though it only appears to have minimal influence on the line.
(2) There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential in changing the slope of the regression line.
(3) There is one point far away from the others, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud of residuals doesn’t appear to fit very well.

(4) There is a primary set of points and then a small point cloud of four outliers. The secondary cloud, based on the residuals, appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere. There might be an interesting explanation for the dual clouds, which is something that could be investigated.
(5) There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least squares line.
(6) There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very influential.
[1] David M Diez, Christopher D Barr, & Mine Çetinkaya-Rundel





