This article introduces simple regression analysis illustrated with a simple example and shows how to calculate it using Python.
1. Simple regression analysis
1.1 Introduction
In layman’s terms, given a bunch of data points \({(x_i, y_i), i = 1, …, n}\), simple regression analysis is to find a straight line that fits them the best. (two dimensions)
$$
y=\alpha +\beta \cdot x
$$
where \(\alpha\) and \(\beta\) are so-called intercept and slop respectively. Such a equation can be used for predicting the unknown value.
Before diving into details, let’s take an example. There are 15 data points, the height and mass (weight), from a sample of American women of age 30–39.
# Height (m)
x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
# Mass (kg)
y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
We make a scatter plot for the above data points.
Fig. 1: An example illustrates simple regression analysis
Clearly, no matter how we plot the straight line, some points have a distance to that line. This is known as the error term, denoted by \(\varepsilon\). Thus, each data point \((x_i, y_j)\) can be precisely described by,
$$
y_i = \alpha + \beta \cdot x_i + \varepsilon_i
$$
So very naturally the question arises: how do we find the ‘best’ line? One possible solution is to find a line that minimizes the sum of squared errors. Such a approach is so-called ordinary least squares (OLS). Formally, find \(\alpha\) and \(\beta\) such that,
$$
\min_{\alpha ,\,\beta}Q(\alpha,\beta) =\sum_{i=1}^{n}\varepsilon_{i}^{\,2}=\sum_{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}
$$
We can derive \(\alpha\) and \(\beta\) by using either calculus, the geometry of inner product spaces, or simply expanding to a quadratic expression.
1.2 Calculate using Python
scipy.stats.linregress
is used to calculate a linear least-squares regression for two sets of measurements.
import scipy.stats
# Calculate a linear least-squares regression for two sets of measurements.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y=None)
# Returns
slope # float, slope of the regression line
intercept # float, intercept of the regression line
r_value # float, correlation coefficient
p_value # float, two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.
std_err # float, standard error of the estimate
Apply scipy.stats.linregress
to the above example, we have,
slop 61.2721865421
intercept -39.0619559188
R-squared 0.989196922446
p-value 3.60351533955e-14
standard error 1.77592275222
As described earlier, slope is denoted by \(\beta\) and intercept is denoted by \(\alpha\). We are able to derive an equation. It can be used to predict the height with the known mass and vice versa.
$$
y = -39.06 + 61.27 \cdot x
$$
R value
Pearson product-moment correlation coefficient, also known as r, R, or Pearson’s r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.
R squared
In statistics, the coefficient of determination, denoted \(R^2\) or \(r^2\) and pronounced “R squared”, is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable.
P value
In frequentist statistics, the p-value is a function of the observed sample results (a test statistic) relative to a statistical model, which measures how extreme the observation is.
Standard error
In regression analysis, the term “standard error” is also used in the phrase standard error of the regression to mean the ordinary least squares estimate of the standard deviation of the underlying errors.
2. The source code
The source code of the above example is hosted on my GitHub, here.
#!/usr/bin/env python
# -*- coding: utf-8 -*-#
from __future__ import division
import scipy.stats
import matplotlib.pyplot as plt
import numpy as np
def main():
'''
This example concerns the data set from the ordinary least squares article.
This data set gives average masses for women as a function of their height
in a sample of American women of age 30–39.
'''
x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
# Step 1: regression analysis
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print "slop\t", slope
print "intercept\t", intercept
print "R-squared\t", r_value**2
print "p-value\t", p_value
print "standard error\t", std_err
# Step 2: plot the graph
fig, ax = plt.subplots()
# plot scatter
ax.plot(x, y, 'bx', markersize=10)
# plot the straight line
new_x = np.linspace(min(x), max(x), 1000)
new_y = slope * new_x + intercept
ax.plot(new_x, new_y, 'r', linewidth=2)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Mass (kg)')
plt.grid()
out_file = 'simple_regression_analysis.png'
plt.savefig(out_file, bbox_inches='tight')
plt.show()
if __name__ == '__main__':
main()
References:
[1] Wikipedia: Simple linear regression
[2] Wikipedia: Coefficient of determination
微信赞赏
支付宝赞赏