What Are the Assumptions of Linear Regression?

Posted 2026-03-19 19:40:52

What Are the Assumptions of Linear Regression?

Linear regression is one of the most widely used statistical methods in econometrics, data science, and many other fields. It helps quantify the relationship between a dependent variable and one or more independent variables. However, for linear regression to produce reliable, unbiased, and interpretable results, several key assumptions must be satisfied. These assumptions form the foundation of the classical linear regression model and guide both model construction and evaluation.

Understanding these assumptions is essential not only for building accurate models but also for diagnosing problems and improving results. This article explains the main assumptions of linear regression in a clear and practical way.

1. Linearity

The first and most fundamental assumption is linearity. It states that the relationship between the dependent variable and the independent variables is linear in parameters.

This does not necessarily mean that the relationship must be a straight line in terms of the variables themselves. Instead, it means that the model can be expressed as a linear combination of coefficients. For example:

[
Y = \beta_0 + \beta_1 X + \varepsilon
]

or even:

[
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon
]

Both are linear in parameters (the betas), even though the second includes a squared term.

Why it matters:
If the true relationship is not linear and the model assumes linearity, the estimates may be biased and misleading.

How to check:
Scatter plots, residual plots, and adding polynomial or interaction terms can help assess and address non-linearity.

2. Independence of Errors

The second assumption is that the error terms are independent of each other. In other words, the residuals (differences between observed and predicted values) should not be correlated.

This assumption is especially important in time series data, where observations are often sequential and may influence each other.

Why it matters:
If errors are correlated (a problem known as autocorrelation), standard errors may be underestimated, leading to incorrect conclusions about statistical significance.

How to check:

Durbin-Watson test
Plotting residuals over time
Examining autocorrelation functions

3. Homoscedasticity (Constant Variance of Errors)

Homoscedasticity means that the variance of the error terms is constant across all levels of the independent variables.

In simple terms, the spread of residuals should be roughly the same regardless of the value of the predictors.

Why it matters:
If the variance changes (heteroscedasticity), the model’s estimates remain unbiased, but they are no longer efficient, and standard errors may be incorrect.

How to check:

Residual vs. fitted value plots
Breusch-Pagan or White tests

Example of violation:
If residuals fan out as values increase, this indicates heteroscedasticity.

4. No Perfect Multicollinearity

This assumption requires that the independent variables are not perfectly linearly related to each other.

Perfect multicollinearity occurs when one predictor can be exactly expressed as a linear combination of others.

Why it matters:
If perfect multicollinearity exists, the regression model cannot uniquely estimate coefficients. Even high (but not perfect) multicollinearity can make estimates unstable and difficult to interpret.

How to check:

Correlation matrix
Variance Inflation Factor (VIF)

Example:
Including both “total income” and “income after tax” in a model may introduce multicollinearity if one is closely tied to the other.

5. Zero Conditional Mean (Exogeneity)

One of the most critical assumptions is that the expected value of the error term, given the independent variables, is zero:

[
E(\varepsilon | X) = 0
]

This means that the independent variables are not correlated with the error term.

Why it matters:
If this assumption is violated, the model suffers from endogeneity, leading to biased and inconsistent estimates.

Common causes of violation:

Omitted variable bias
Measurement error
Simultaneity (two-way causality)

Example:
If you are modeling wages based on education but omit ability (which affects both education and wages), the error term will be correlated with education.

6. Normality of Errors (for Inference)

The assumption of normality states that the error terms are normally distributed.

This assumption is not strictly necessary for estimating coefficients, but it is important for conducting hypothesis tests and constructing confidence intervals, especially in small samples.

Why it matters:
Non-normal errors can affect the validity of statistical tests, particularly when sample sizes are small.

How to check:

Histogram of residuals
Q-Q plots
Shapiro-Wilk test

Note:
With large sample sizes, the Central Limit Theorem often reduces the importance of this assumption.

7. No Measurement Error in Independent Variables

Linear regression assumes that the independent variables are measured without error.

Why it matters:
Measurement errors in predictors can lead to biased and inconsistent estimates, typically biasing coefficients toward zero (attenuation bias).

Example:
If income is reported inaccurately in a survey, regression results using income as a predictor may be distorted.

8. Correct Model Specification

The model must be correctly specified, meaning:

All relevant variables are included
No irrelevant variables are added
The functional form is appropriate

Why it matters:
Misspecification can lead to biased coefficients and poor predictive performance.

Examples of misspecification:

Omitting an important variable
Using a linear model when the true relationship is nonlinear
Ignoring interaction effects

Consequences of Violating Assumptions

Not all assumption violations have the same impact. Here is a quick summary:

Linearity violation: biased estimates
Autocorrelation: incorrect standard errors
Heteroscedasticity: inefficient estimates and misleading inference
Multicollinearity: unstable coefficients
Endogeneity: biased and inconsistent estimates
Non-normality: unreliable hypothesis tests (in small samples)

Understanding which assumption is violated helps determine the appropriate remedy.

Practical Approaches to Address Violations

When assumptions are not met, several techniques can help:

Transformations: log or square root transformations for non-linearity or heteroscedasticity
Robust standard errors: to handle heteroscedasticity
Instrumental variables: to address endogeneity
Time-series models: for autocorrelation
Regularization methods (e.g., Ridge, Lasso): for multicollinearity

Model diagnostics should always be part of the regression workflow.

Conclusion

The assumptions of linear regression are essential for ensuring that results are reliable, interpretable, and statistically valid. While real-world data rarely satisfies all assumptions perfectly, understanding them allows researchers and analysts to detect problems and apply appropriate solutions.

In practice, linear regression is as much about diagnosing and refining models as it is about estimating relationships. By carefully checking assumptions such as linearity, independence, homoscedasticity, and exogeneity, one can significantly improve both the quality and credibility of empirical analysis.