Simple regression analysis using Python

This article introduces simple regression analysis illustrated with a simple example and shows how to calculate it using Python.

1. Simple regression analysis

1.1 Introduction

In layman’s terms, given a bunch of data points ${(x_i, y_i), i = 1, …, n}$, simple regression analysis is to find a straight line that fits them the best. (two dimensions)

$$y=\alpha +\beta \cdot x$$

where $\alpha$ and $\beta$ are so-called intercept and slop respectively. Such a equation can be used for predicting the unknown value.

Before diving into details, let’s take an example. There are 15 data points, the height and mass (weight), from a sample of American women of age 30–39.

# Height (m)
x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]

# Mass (kg)
y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]


We make a scatter plot for the above data points. Fig. 1: An example illustrates simple regression analysis

Clearly, no matter how we plot the straight line, some points have a distance to that line. This is known as the error term, denoted by $\varepsilon$. Thus, each data point $(x_i, y_j)$ can be precisely described by,

$$y_i = \alpha + \beta \cdot x_i + \varepsilon_i$$

So very naturally the question arises: how do we find the ‘best’ line? One possible solution is to find a line that minimizes the sum of squared errors. Such a approach is so-called ordinary least squares (OLS). Formally, find $\alpha$ and $\beta$ such that,

$$\min_{\alpha ,\,\beta}Q(\alpha,\beta) =\sum_{i=1}^{n}\varepsilon_{i}^{\,2}=\sum_{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}$$

We can derive $\alpha$ and $\beta$ by using either calculus, the geometry of inner product spaces, or simply expanding to a quadratic expression.

1.2 Calculate using Python

scipy.stats.linregress is used to calculate a linear least-squares regression for two sets of measurements.

import scipy.stats

# Calculate a linear least-squares regression for two sets of measurements.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y=None)

# Returns
slope       # float, slope of the regression line
intercept   # float, intercept of the regression line
r_value     # float, correlation coefficient
p_value     # float, two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.
std_err     # float, standard error of the estimate


Apply scipy.stats.linregress to the above example, we have,

slop            61.2721865421
intercept       -39.0619559188
R-squared       0.989196922446
p-value         3.60351533955e-14
standard error  1.77592275222


As described earlier, slope is denoted by $\beta$ and intercept is denoted by $\alpha$. We are able to derive an equation. It can be used to predict the height with the known mass and vice versa.

$$y = -39.06 + 61.27 \cdot x$$

R value

Pearson product-moment correlation coefficient, also known as r, R, or Pearson’s r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.

R squared

In statistics, the coefficient of determination, denoted $R^2$ or $r^2$ and pronounced “R squared”, is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable.

P value

In frequentist statistics, the p-value is a function of the observed sample results (a test statistic) relative to a statistical model, which measures how extreme the observation is.

Standard error

In regression analysis, the term “standard error” is also used in the phrase standard error of the regression to mean the ordinary least squares estimate of the standard deviation of the underlying errors.

2. The source code

The source code of the above example is hosted on my GitHub, here.

#!/usr/bin/env python
# -*- coding: utf-8 -*-#
from __future__ import division

import scipy.stats
import matplotlib.pyplot as plt
import numpy as np

def main():

'''
This example concerns the data set from the ordinary least squares article.
This data set gives average masses for women as a function of their height
in a sample of American women of age 30–39.
'''

x = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
y = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]

# Step 1: regression analysis
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)

print "slop\t", slope
print "intercept\t", intercept
print "R-squared\t", r_value**2
print "p-value\t", p_value
print "standard error\t", std_err

# Step 2: plot the graph
fig, ax = plt.subplots()

# plot scatter
ax.plot(x, y, 'bx', markersize=10)

# plot the straight line
new_x = np.linspace(min(x), max(x), 1000)
new_y = slope * new_x + intercept

ax.plot(new_x, new_y, 'r', linewidth=2)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Mass (kg)')
plt.grid()

out_file = 'simple_regression_analysis.png'
plt.savefig(out_file, bbox_inches='tight')

plt.show()

if __name__ == '__main__':
main()


References:
 Wikipedia: Simple linear regression
 Wikipedia: Coefficient of determination