
SIG Quantitative Researcher Intern Interview Question: Linear Regression Assumptions and Violations
Linear regression is a cornerstone of quantitative analysis and modeling, especially in financial firms like Susquehanna International Group (SIG). Understanding its foundational assumptions is crucial for aspiring quantitative researcher interns. In interviews, one of the most common—and critical—questions is about the assumptions underlying linear regression and the practical implications of violating them. This article offers an exhaustive guide to linear regression assumptions, their importance, how to detect violations, and the impact of these violations on your models. Whether you're preparing for an SIG Quantitative Researcher Intern interview or seeking to master regression for your quantitative toolset, this resource has you covered.
SIG Quantitative Researcher Intern Interview Question: Linear Regression Assumptions and Violations
Introduction to Linear Regression
Linear regression is a statistical technique for modeling the relationship between a dependent variable and one or more independent variables. Simple linear regression deals with one independent variable, while multiple linear regression involves several. The method estimates parameters (coefficients) to minimize the difference between observed values and model predictions, typically using the Ordinary Least Squares (OLS) approach.
The standard linear regression equation is:
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \varepsilon $$
- \( y \): Dependent variable (response)
- \( x_1, x_2, ..., x_p \): Independent variables (predictors)
- \( \beta_0 \): Intercept
- \( \beta_1, ..., \beta_p \): Coefficients for each predictor
- \( \varepsilon \): Random error term
Linear regression is widely used for its interpretability and computational efficiency. However, its validity depends on several statistical assumptions. Knowing these assumptions—and how to test and respond to their violations—is a fundamental skill for any quantitative researcher.
Key Assumptions of Linear Regression
The reliability of linear regression analysis is based on several key assumptions. Violations can lead to biased, inefficient, or misleading results. Let's explore each assumption in detail.
1. Linearity
Assumption: The relationship between the independent variables and the dependent variable is linear.
Mathematically, this means that the expected value of \( y \) given the predictors is a linear function:
$$ E(y|X) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p $$
Why it matters: If the true relationship is nonlinear, the linear model will systematically misestimate the relationship, leading to biased predictions and misleading inference.
2. Independence of Errors
Assumption: The residuals (errors) are independent of each other.
Formally,
$$ Cov(\varepsilon_i, \varepsilon_j) = 0 \quad \forall i \neq j $$
Why it matters: If errors are correlated (e.g., in time series or panel data), standard errors will be underestimated, inflating the significance of predictors and leading to unreliable hypothesis tests.
3. Homoscedasticity (Constant Variance of Errors)
Assumption: The variance of the residuals is constant across all levels of the independent variables.
Mathematically:
$$ Var(\varepsilon_i) = \sigma^2 \quad \text{for all } i $$
Why it matters: If residual variance changes with the level of predictors (heteroscedasticity), efficiency of the OLS estimates deteriorates, and standard errors become unreliable.
4. Normality of Errors
Assumption: The residuals are normally distributed.
$$ \varepsilon_i \sim N(0, \sigma^2) $$
Why it matters: Normality is critical for valid hypothesis testing and confidence intervals. The estimator remains unbiased even if normality is violated, but inference (p-values, confidence intervals) may be invalid.
5. No Perfect Multicollinearity
Assumption: No independent variable is a perfect linear function of others.
Why it matters: Perfect multicollinearity makes it impossible to estimate unique regression coefficients, as the predictors are redundant.
6. No Measurement Error in Predictors
Assumption: The independent variables are measured without error.
Why it matters: Errors in predictors lead to biased and inconsistent parameter estimates (attenuation bias).
Detailed Explanation of Each Assumption and Its Violations
1. Linearity
Description: The model should capture the true relationship between predictors and response as a straight line (or plane for multiple predictors).
How to Check
- Plot residuals vs. predicted values or each predictor. Non-random patterns suggest nonlinearity.
- Use partial residual plots or component plus residual (CR) plots.
Consequences of Violation
- Biased estimates of coefficients.
- Poor predictive performance.
- Misleading conclusions about variable importance.
Remedies
- Apply nonlinear transformations (log, square root, polynomial terms).
- Use generalized additive models (GAM) or other nonlinear models.
# Example: Adding a polynomial term
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.arange(10).reshape(-1, 1)
y = 2 * X.squeeze() + 0.5 * (X.squeeze() ** 2) + np.random.normal(size=10)
# Add polynomial term
X_poly = np.hstack((X, X ** 2))
model = LinearRegression().fit(X_poly, y)
print(model.coef_)
2. Independence of Errors
Description: Each observation’s error should be unrelated to the others. This is especially important in time series or clustered data.
How to Check
- Durbin-Watson test (for autocorrelation in residuals).
- Plot residuals over time or sequence.
Consequences of Violation
- Underestimated standard errors, leading to spurious statistical significance.
- Biased inference.
Remedies
- Use time series models (AR, ARMA, ARIMA).
- Include lagged dependent variables or predictors.
- Cluster-robust standard errors.
# Example: Durbin-Watson test in statsmodels
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson
model = sm.OLS(y, sm.add_constant(X)).fit()
dw = durbin_watson(model.resid)
print('Durbin-Watson statistic:', dw)
3. Homoscedasticity (Constant Variance of Errors)
Description: The spread of residuals should be roughly the same across all levels of the predicted values or independent variables.
How to Check
- Plot residuals versus predicted values; look for "fan" or "cone" shapes indicating non-constant variance.
- Breusch-Pagan or White test for formal assessment.
Consequences of Violation
- Inefficient estimates (not minimum variance).
- Standard errors (and thus confidence intervals and p-values) are incorrect.
Remedies
- Transform the dependent variable (e.g., log-transform).
- Use weighted least squares regression.
- Robust standard errors (e.g., White's robust SE).
# Example: Breusch-Pagan test in statsmodels
import statsmodels.stats.api as sms
test = sms.het_breuschpagan(model.resid, model.model.exog)
print('Lagrange multiplier statistic:', test[0])
print('p-value:', test[1])
4. Normality of Errors
Description: The error terms should be normally distributed, especially important for small samples.
How to Check
- Histogram or Q-Q plot of residuals.
- Shapiro-Wilk, Kolmogorov-Smirnov, or Anderson-Darling tests.
Consequences of Violation
- Invalid confidence intervals and hypothesis tests.
- Estimates remain unbiased but inference is problematic for small samples.
- Central Limit Theorem often mitigates this for large samples.
Remedies
- Transform the dependent variable.
- Use bootstrapping for inference.
- Non-parametric or robust regression methods.
# Example: Q-Q plot
import scipy.stats as stats
import matplotlib.pyplot as plt
stats.probplot(model.resid, dist="norm", plot=plt)
plt.show()
5. No Perfect Multicollinearity
Description: No predictor is a perfect linear combination of others. High (but imperfect) correlation among predictors is called multicollinearity.
How to Check
- Calculate correlation matrix for predictors.
- Variance Inflation Factor (VIF) for each predictor.
Consequences of Violation
- Perfect multicollinearity: Model cannot estimate unique coefficients.
- High multicollinearity: Large standard errors, unstable coefficient estimates, unreliable inference.
Remedies
- Remove or combine highly correlated predictors.
- Apply dimensionality reduction (e.g., PCA).
- Regularization techniques (Ridge or Lasso regression).
# Example: Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = sm.add_constant(X)
vif = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
print('VIF:', vif)
6. No Measurement Error in Predictors
Description: Predictors are assumed to be measured accurately. In reality, measurement error can bias coefficient estimates towards zero (attenuation bias).
How to Check
- Compare multiple sources or repeated measurements if available.
- Instrumental variables can be used if a suitable instrument is available.
Consequences of Violation
- Attenuation bias: Estimates of coefficients are biased towards zero.
- Loss of statistical power.
Remedies
- Improve measurement accuracy.
- Use instrumental variable regression.
# Example: Instrumental variables with statsmodels
import statsmodels.formula.api as smf
iv_model = smf.iv.IV2SLS.from_formula('y ~ 1 + x1 + [x2 ~ z]', data=df).fit()
print(iv_model.summary())
Summary Table: Linear Regression Assumptions and Violations
| Assumption | Check | Violation Impact | Remedies |
|---|---|---|---|
| Linearity | Residual plots, CR plots | Biased coefficients, poor fit | Transformations, add polynomial terms, nonlinear models |
| Independence | Durbin-Watson, residual plots | Underestimated SE, spurious significance | Time series models, cluster-robust SE |
| Homoscedasticity | Residuals vs. fits, BP/White tests | Inefficient estimates, incorrect SE | Transform y, weighted regression, robust SE |
| Normality | Histogram, Q-Q plot, normality tests | Invalid inference (small n) | Transform y, bootstrap, robust regression |
| No Multicollinearity | Correlation, VIF | Unstable estimates, inflated SE | Remove/merge predictors, PCA, regularization |
| No Measurement Error | Multiple sources, IV regression | Attenuation bias, loss of power | Improve measurement, use IV |
Practical Examples: Detecting and Addressing Violations
1. Visual Inspection of Residuals
Plotting residuals is the first and most intuitive diagnostic tool for multiple assumptions:
import matplotlib.pyplot as plt
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()
Interpretation:
- No pattern: Likely linearity and homoscedasticity met.
- Funnel shape: Heterosced
asticity likely present (variance of errors increases/decreases with fitted values).
- Curvature: Linearity assumption is likely violated.
- Clusters or serial patterns: Independence assumption may be violated (especially in time series data).
2. Checking for Multicollinearity
Multicollinearity can be subtle, especially in high-dimensional data. The Variance Inflation Factor (VIF) quantifies how much the variance of a given regression coefficient is inflated due to multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Assuming X is a DataFrame with predictors plus constant
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Interpretation:
- VIF < 5: Generally acceptable.
- VIF > 10: Indicates serious multicollinearity; consider removing or combining predictors.
3. Testing Homoscedasticity
The Breusch-Pagan test is a formal statistical test for heteroscedasticity:
from statsmodels.stats.diagnostic import het_breuschpagan
# model is a fitted statsmodels regression object
test_stat, p_value, _, _ = het_breuschpagan(model.resid, model.model.exog)
print(f"Breusch-Pagan p-value: {p_value}")
A small p-value (e.g., < 0.05) suggests evidence of heteroscedasticity.
4. Testing Normality of Residuals
Use a Q-Q plot or Shapiro-Wilk test to assess normality:
import scipy.stats as stats
import matplotlib.pyplot as plt
# Q-Q plot
stats.probplot(model.resid, dist="norm", plot=plt)
plt.show()
# Shapiro-Wilk test
stat, p = stats.shapiro(model.resid)
print(f'Shapiro-Wilk p-value: {p}')
A small p-value (e.g., < 0.05) from the Shapiro-Wilk test suggests non-normality.
5. Checking for Independence (Autocorrelation)
The Durbin-Watson statistic is widely used for detecting autocorrelation in time series data:
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(model.resid)
print(f"Durbin-Watson statistic: {dw_stat}")
Durbin-Watson values:
- ~2: No autocorrelation
- <2: Positive autocorrelation
- >2: Negative autocorrelation
6. Remedies for Common Violations
- Nonlinearity: Add polynomial or interaction terms, use non-linear models.
- Heteroscedasticity: Transform the dependent variable (e.g., log or sqrt), use robust standard errors, or weighted least squares.
- Multicollinearity: Remove or combine correlated predictors, use regularization techniques (Ridge/Lasso), or principal component analysis (PCA).
- Non-normality: Transform the outcome, use bootstrapping or robust regression, especially for small samples.
- Autocorrelation: Use time series models that account for autocorrelation, such as ARIMA or add lagged variables.
- Measurement Error: Use instrumental variables or improve measurement precision.
Real-World Examples of Assumption Violations in Quantitative Research
Example 1: Stock Return Prediction and Heteroscedasticity
In finance, modeling stock returns often results in heteroscedastic residuals—volatility tends to cluster in time. Standard OLS regression may underestimate risk in such cases. Quantitative researchers may use GARCH models or robust standard errors to address this.
Example 2: Multicollinearity in Economic Indicators
Suppose you are modeling economic growth using unemployment rate, inflation, and consumer confidence index. These predictors can be highly correlated, leading to inflated standard errors and unstable coefficient estimates. One solution is to use Principal Component Regression (PCR) to reduce dimensionality and remove multicollinearity.
Example 3: Nonlinearity in Option Pricing
The relationship between option price and underlying asset price is nonlinear (Black-Scholes model). Using a linear model without transformation will yield poor results. Transforming variables or using nonlinear regression is essential for accurate modeling.
Example 4: Autocorrelated Errors in High-Frequency Trading
In high-frequency trading, price movements are often autocorrelated over very short intervals. Using OLS regression without accounting for this can mislead risk estimates and trading signals. Time series models or including lagged variables often remedy this violation.
Why Understanding Assumptions Is Critical for SIG Quantitative Researcher Interns
SIG and similar firms operate in fast-paced, data-rich environments where model accuracy is paramount. As a Quantitative Researcher Intern, you will often build, test, and deploy statistical models to inform trading strategies, risk management, and portfolio optimization.
Failing to check model assumptions can lead to:
- False confidence in model predictions.
- Misleading risk assessments and financial losses.
- Poor generalization to new data.
Conversely, a deep understanding of assumptions and diagnostics demonstrates quantitative rigor—a quality SIG values highly.
Interview Tips: How to Discuss Linear Regression Assumptions at SIG
- Be Comprehensive: List all key assumptions, not just linearity and normality.
- Explain Why Each Matters: Relate each assumption to model reliability and inference.
- Discuss Detection: Reference both graphical (plots) and statistical tests.
- Offer Remedies: Show you know how to address violations in practice.
- Relate to Finance: Use examples relevant to trading, time series, or financial modeling.
Sample answer excerpt:
"In linear regression, the key assumptions are linearity, independence, homoscedasticity, normality of errors, no multicollinearity, and accurate measurement of predictors. Violations can bias estimates or lead to unreliable inference. For example, if residuals are heteroscedastic, I'd check residual plots and perform a Breusch-Pagan test, then remedy with robust standard errors or a log transformation. In financial time series, independence is often violated, so I'd consider time series models. Diagnosing and addressing these issues ensures model robustness and reliability in trading applications."
Frequently Asked Questions (FAQ) on Linear Regression Assumptions
| Question | Answer |
|---|---|
| Do all assumptions need to be perfectly met? | No. Some violations (e.g., normality with large samples) are less critical due to the Central Limit Theorem. However, severe violations of linearity, independence, or multicollinearity can invalidate results. |
| Is linear regression robust to outliers? | No. Outliers can distort parameter estimates and residuals. Consider robust regression methods or carefully inspect for outliers. |
| How important is normality of errors? | Primarily important for inference (p-values, confidence intervals), especially in small samples. OLS estimates remain unbiased even if normality is violated. |
| What is the most common violation in financial data? | Heteroscedasticity and autocorrelated errors are common, due to volatility clustering and temporal dependencies in financial time series. |
| How to handle multicollinearity in high-dimensional finance data? | Use regularization (Ridge/Lasso), principal component regression, or remove/reduce correlated predictors. |
Conclusion: Mastering Linear Regression Assumptions for SIG Interviews and Beyond
A strong grasp of linear regression assumptions and the consequences of their violation is essential for any aspiring SIG Quantitative Researcher Intern. These concepts form the basis for sound model-building, accurate inference, and robust quantitative analysis in finance and beyond. In your interviews and practical work, be ready to discuss not just what the assumptions are, but why they matter and how you would diagnose and address violations with both statistical rigor and practical solutions.
By mastering these principles, you demonstrate to SIG—and any data-driven organization—that you are prepared to build reliable, interpretable, and actionable models in the complex world of quantitative finance.
Further Reading and Resources
- Assumptions of Linear Regression (Statistics Solutions)
- Scikit-learn: Linear Models
- Statsmodels: Regression Analysis
- Ordinary Least Squares (Wikipedia)
For SIG Quantitative Researcher Intern interviews, being able to articulate and apply your knowledge of linear regression assumptions will set you apart as a thoughtful, analytical, and practical candidate. Good luck!
