
Quant Interview Question on Regression R-squared
When interviewing for data science or analytics roles, you’ll often encounter questions about regression metrics—especially R-squared (\( R^2 \)). A deceptively simple interview scenario is: You add 12 new features to your linear regression model. R² improves from 0.71 to 0.91. Is your model better? Most candidates confidently answer “yes.” But to impress a seasoned interviewer, you need to dig deeper. Let’s explore why, despite the significant jump in \( R^2 \), this isn’t the whole story—and what you should say instead.
Understanding R-squared in Regression
What is R-squared?
R-squared (\( R^2 \)), or the coefficient of determination, is one of the most widely used metrics to evaluate the fit of a linear regression model. It quantifies the proportion of variance in the dependent variable that is predictable from the independent variables.
Mathematically, \( R^2 \) is defined as:
$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$
- \( SS_{res} \): The sum of squares of residuals (difference between observed and predicted values).
- \( SS_{tot} \): The total sum of squares (variance of the observed data from the mean).
A higher \( R^2 \) value—closer to 1—suggests a better fit. So it’s tempting to equate a jump from 0.71 to 0.91 with substantial model improvement. But is that really true?
The Fatal Flaw: R² Always Rewards More Features
The crucial weakness with R-squared is that it always increases or stays the same when you add more features, regardless of whether those features are meaningful or just random noise.
Why Does This Happen?
Every time you add a new variable (feature) to your regression model, you’re giving your model more flexibility to fit the training data. Even if the new variable is unrelated to the outcome, the model can use it to explain random variation—effectively memorizing quirks of the training data.
Let’s make this concrete:
- Suppose you have a model with one predictor, and you add a column of completely random numbers as a new feature.
- The model will use this random feature to fit the data slightly better, reducing the residual sum of squares (\( SS_{res} \)), and thus increasing \( R^2 \).
- Add 12 such columns, and the effect compounds. \( R^2 \) can jump significantly—even if the new features have zero predictive value.
Key point: The formula for \( R^2 \) does not penalize for model complexity or the addition of irrelevant features.
Equation Recap
Let’s look at the regression equation:
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon $$
Every new feature (\( x_{p+1} \), \( x_{p+2} \), etc.) increases the model’s ability to explain variance in training data, but not necessarily in unseen data.
The Essay Analogy: Padding with Irrelevance
To grasp the intuition behind why rising \( R^2 \) can be misleading, consider the essay analogy:
- Imagine a student writing an essay assignment that requires a minimum word count.
- The student adds several paragraphs, but they are off-topic and add no value to the core argument.
- The essay is now longer and appears more comprehensive at first glance. But does it actually say anything more of substance?
In regression, adding more features—relevant or not—increases the “word count” of your model. It makes the model look more impressive by the metric of \( R^2 \), but might not deliver any real improvement in predictive power or generalizability.
What Interviewers Really Want: Metrics That Tell the Truth
To answer the interview question like a pro, explain that while \( R^2 \) went up, this is expected behavior due to the metric’s design. The real test is whether the new features genuinely improve model performance. For that, we use metrics that balance fit with complexity and assess performance on unseen data.
1. Adjusted R-squared
Adjusted R-squared (\( \bar{R}^2 \)) modifies the \( R^2 \) formula to penalize the addition of new predictors that don’t improve the model sufficiently. It’s calculated as:
$$ \bar{R}^2 = 1 - \frac{(1 - R^2)\,(n - 1)}{n - p - 1} $$
- \( n \) = number of observations
- \( p \) = number of predictors (features)
Key takeaway: Adjusted \( R^2 \) only increases if the new features improve the model more than would be expected by chance.
Python Example: Comparing R² and Adjusted R²
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Simulated data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 3 * X.squeeze() + np.random.randn(100) * 0.5
# Add 12 columns of random noise as new features
X_noise = np.random.rand(100, 12)
X_full = np.hstack([X, X_noise])
# Fit models
model_basic = LinearRegression().fit(X, y)
model_full = LinearRegression().fit(X_full, y)
# Compute scores
r2_basic = r2_score(y, model_basic.predict(X))
r2_full = r2_score(y, model_full.predict(X_full))
n, p_basic, p_full = len(y), X.shape[1], X_full.shape[1]
adj_r2_basic = 1 - (1 - r2_basic) * (n - 1) / (n - p_basic - 1)
adj_r2_full = 1 - (1 - r2_full) * (n - 1) / (n - p_full - 1)
print("R² basic:", r2_basic)
print("Adjusted R² basic:", adj_r2_basic)
print("R² with noise:", r2_full)
print("Adjusted R² with noise:", adj_r2_full)
Notice how Adjusted \( R^2 \) may decrease or remain unchanged when you add irrelevant features, even as \( R^2 \) rises.
2. AIC and BIC: Balancing Fit and Complexity
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are model selection metrics that explicitly penalize model complexity.
| Metric | Formula | Interpretation |
|---|---|---|
| AIC | \( 2k - 2\ln(L) \) | Lower is better; penalizes number of parameters (\( k \)) |
| BIC | \( \ln(n)k - 2\ln(L) \) | Stronger penalty for complexity; lower is better |
- \( k \): Number of parameters (features)
- \( n \): Number of observations
- \( L \): Likelihood of the model
Key point: Both AIC and BIC will only decrease if the new features meaningfully improve model fit, after accounting for their added complexity.
Python Example: Calculating AIC and BIC
import statsmodels.api as sm
# Add a constant for intercept
X_basic = sm.add_constant(X)
X_full_sm = sm.add_constant(X_full)
model_basic = sm.OLS(y, X_basic).fit()
model_full = sm.OLS(y, X_full_sm).fit()
print("AIC (basic):", model_basic.aic)
print("BIC (basic):", model_basic.bic)
print("AIC (full):", model_full.aic)
print("BIC (full):", model_full.bic)
If AIC and BIC increase after adding features, your model is less efficient, despite the higher R-squared.
3. Validation R-squared: Does Your Model Generalize?
The most important test: does your model’s performance improve on unseen data? This is where validation R-squared comes in.
- Training R²: Computed on the data used to fit the model.
- Validation/Test R²: Computed on new data not seen during training.
If your training \( R^2 \) jumps from 0.71 to 0.91 after adding 12 features, but your validation \( R^2 \) stays at 0.71 (or even drops), you’ve simply overfit the training data—memorizing noise instead of learning real patterns.
Python Example: Train vs. Validation R²
from sklearn.model_selection import train_test_split
# Split data
Xtr, Xval, ytr, yval = train_test_split(X_full, y, test_size=0.3, random_state=42)
# Fit model
model = LinearRegression().fit(Xtr, ytr)
# R² on training and validation
r2_train = r2_score(ytr, model.predict(Xtr))
r2_val = r2_score(yval, model.predict(Xval))
print("Training R²:", r2_train)
print("Validation R²:", r2_val)
A large gap between training and validation \( R^2 \) is a red flag for overfitting.
Best Interview Answer: "Is your model better?"
Here’s how you might combine the above insights into a stellar answer:
“While the R-squared value increased from 0.71 to 0.91 when I added 12 new features, this does not necessarily mean the model is better. R-squared always increases or stays the same when more features are added, even if they’re just random noise. This is a known limitation because R-squared doesn’t penalize for model complexity.
To truly assess model improvement, I would check metrics that account for complexity, such as Adjusted R-squared, AIC, and BIC. Most importantly, I would look at the model’s performance on validation or test data. If validation R-squared doesn’t improve, then the model is likely just overfitting the training data, and the new features aren’t adding real predictive value.”
Why Does R-squared Behave This Way? (Mathematical Insight)
Let’s briefly see why R-squared can only stay constant or increase when adding features.
- By adding more variables, your model always has at least as much flexibility to fit the data. The residual sum of squares (\( SS_{res} \)) can never increase.
- Since \( SS_{res} \) decreases (or stays the same), and \( SS_{tot} \) is constant, the value of \( 1 - \frac{SS_{res}}{SS_{tot}} \) can never decrease.
This is why, even if you add features that are just random noise, \( R^2 \) will not go down.
Visualizing Overfitting: R-squared vs. Adjusted R-squared
A powerful way to understand overfitting is to plot both \( R^2 \) and \( \bar{R}^2 \) as you add more features.
Example: Synthetic Regression
import matplotlib.pyplot as plt
r2s, adjr2s = [], []
features_range = range(1, 21)
for p in features_range:
X_p = np.random.rand(100, p)
model = LinearRegression().fit(X_p, y)
r2 = r2_score(y, model.predict(X_p))
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
r2s.append(r2)
adjr2s.append(adj_r2)
plt.plot(features_range, r2s, label='R²')
plt.plot(features_range, adjr2s, label='Adjusted R²')
plt.xlabel('Number of features')
plt.ylabel('Score')
plt.legend()
plt.title('R² vs Adjusted R² as Features Increase')
plt.show()
Typically, you’ll see \( R^2 \) climb steadily, while Adjusted \( R^2 \) peaks and then levels off or declines as noise is added.
Common Interview Mistakes: What Not to Say
- “R² went up, so the model is definitely better.” — This ignores overfitting and the lack of complexity penalty in \( R^2 \).
- “The model explains 91% of the variance now, so I trust it.” — Training variance explained often overstates true predictive performance.
- “Adding features always improves a regression model.” — Only in terms of fitting the training data, not in terms of generalization.
Tip: Always bring up validation metrics and overfitting in your answers.
Follow-Up Questions (and How to Answer)
Q: What if Adjusted R² goes down?
A decrease in Adjusted R² after adding new features indicates that the added variables are not truly contributing to predictive power. In fact, they are likely introducing noise or redundancy, which harms the model’s ability to generalize. This is Adjusted R²’s main advantage over plain R²—it signals when a model is being overfit through unnecessary complexity.
Q: Can you explain why AIC and BIC are preferred for model selection?
AIC and BIC are considered superior for model selection because they weigh two competing factors: the goodness of fit and the simplicity (parsimony) of the model. While both reward models that fit the data well, they also apply a penalty for the number of parameters (features). BIC applies a heavier penalty than AIC, especially as sample size increases. This discourages overfitting and encourages selecting models that are both accurate and simple. In practice, the model with the lowest AIC or BIC is preferred.
Q: What if validation R² decreases after adding features?
If your validation R² decreases after adding new features—even as training R² increases—it’s a clear sign of overfitting. The model has started to memorize the training data, including its random noise, but this does not translate to better performance on unseen data. Always prioritize validation metrics over training metrics when evaluating model improvements.
Practical Guidelines: Avoiding the R² Trap in Real Projects
Understanding R² is crucial beyond interviews—it’s essential for real-world data science. Here are some practical tips to ensure your regression modeling is robust:
- Always check Adjusted R², AIC, and BIC alongside R². Don’t rely solely on R² to judge improvement.
- Use cross-validation. Instead of a single validation split, use k-fold cross-validation to get a better estimate of your model’s out-of-sample performance.
- Perform feature selection. Use methods like recursive feature elimination, LASSO regression, or tree-based feature importances to remove irrelevant predictors.
- Visualize residuals. After adding features, check residual plots for patterns that might indicate overfitting or model misspecification.
- Interpretability matters. More features make models harder to interpret. Only include variables that make business or scientific sense.
Deep Dive: R² and Noise—A Demonstration
Let’s see a simple simulation to illustrate just how easily R² is fooled by noise features:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
np.random.seed(42)
n, noise_features = 200, 20
# True signal: y depends on x
x = np.random.randn(n, 1)
y = 2 * x.squeeze() + np.random.randn(n) * 0.5
# Add pure noise features
noise = np.random.randn(n, noise_features)
X_all = np.hstack([x, noise])
# Fit model with true feature only
model_true = LinearRegression().fit(x, y)
r2_true = r2_score(y, model_true.predict(x))
# Fit model with true + noise
model_noise = LinearRegression().fit(X_all, y)
r2_noise = r2_score(y, model_noise.predict(X_all))
print("R² with true variable only:", r2_true)
print("R² with 20 noise features added:", r2_noise)
You’ll find that R² increases when you add the noise features, even though they offer no real predictive value. This is why experts caution against using R² as the sole indicator of model quality.
Summary Table: Regression Metrics for Model Comparison
| Metric | How It Works | Penalizes Complexity? | Best Use |
|---|---|---|---|
| R² | Measures fit to training data. | No | Quick check, but not for model selection. |
| Adjusted R² | Modifies R² to penalize excess features. | Yes | Comparing nested models; feature selection. |
| AIC | Balances fit and complexity using likelihood. | Yes | Model selection in large feature sets. |
| BIC | Like AIC, but stronger penalty for large n. | Yes | More conservative model selection. |
| Validation/Test R² | Measures fit on unseen data. | Indirectly (via split data) | Assessing generalization/overfitting. |
Advanced Interview Discussion: Nested Models and Feature Selection
For advanced roles, you might be asked about nested models or feature selection strategies. Here’s how these concepts relate:
- Nested Models: A model is “nested” inside another if it contains a subset of the other model’s predictors. Adjusted R², AIC, and BIC are all designed to compare nested models fairly.
- Feature Selection: Rather than adding all possible features, use data-driven or domain-driven methods to retain only those that add value. Penalized regression methods like LASSO automatically drive the coefficients of irrelevant features to zero.
from sklearn.linear_model import LassoCV
# Use LASSO for feature selection
lasso = LassoCV(cv=5).fit(X_all, y)
selected = np.where(lasso.coef_ != 0)[0]
print(f"Selected features: {selected}")
Key Takeaways for Your Interview (and Beyond)
- Never judge a model by R² alone. High R² can be achieved by overfitting, especially when adding many features.
- Mention Adjusted R², AIC, BIC, and validation R² in your answers. These metrics reflect true model improvement.
- Explain the limitations. Articulate why R² increases with more features—even if they’re random noise.
- Emphasize generalization. Always validate model performance on unseen data.
- Use analogies. The essay analogy helps interviewers see you understand the concept beyond formulas.
Sample Model Comparison Workflow
# 1. Split data
Xtr, Xval, ytr, yval = train_test_split(X_all, y, test_size=0.3, random_state=0)
# 2. Fit models
model_trivial = LinearRegression().fit(Xtr[:, :1], ytr)
model_full = LinearRegression().fit(Xtr, ytr)
# 3. Compute metrics
def adjusted_r2(r2, n, p):
return 1 - (1 - r2) * (n - 1) / (n - p - 1)
metrics = []
for mdl, X_set in zip([model_trivial, model_full], [Xtr[:, :1], Xtr]):
pred_tr = mdl.predict(Xtr if X_set.shape[1] > 1 else Xtr[:, :1])
pred_val = mdl.predict(Xval if X_set.shape[1] > 1 else Xval[:, :1])
r2_tr = r2_score(ytr, pred_tr)
r2_val = r2_score(yval, pred_val)
adj_r2_tr = adjusted_r2(r2_tr, len(ytr), X_set.shape[1])
adj_r2_val = adjusted_r2(r2_val, len(yval), X_set.shape[1])
metrics.append((r2_tr, adj_r2_tr, r2_val, adj_r2_val))
print("Training R², Adjusted R², Validation R², Validation Adjusted R²")
for row in metrics:
print(row)
This approach helps you compare not just fit but also generalizability and penalized metrics—essential for serious model evaluation.
Conclusion: How to Impress in Your Data Science Interview
When faced with the classic regression interview question about a dramatic R² increase after adding features, don’t fall into the trap of equating that with a better model. Instead:
- Show you understand the mathematical limitation of R²—it cannot decrease as you add more variables.
- Use the essay analogy to illustrate your point.
- Immediately bring up Adjusted R², AIC, BIC, and validation R² to demonstrate you know how to truly assess model quality.
- Stress the importance of generalization and avoiding overfitting.
By framing your answer this way, you’ll stand out as a candidate who understands not just the mechanics of regression, but the deeper principles of sound statistical modeling—exactly what top interviewers are looking for.
Remember: In data science, the best models aren’t the ones that fit the training data most closely—they’re the ones that make the most accurate predictions on data the model has never seen.
Related Articles
- Data Scientist Interview Questions Asked at Netflix, Apple & LinkedIn
- Top Data Scientist Interview Questions Asked at Google and Meta
- Quant Finance and Data Science Interview Prep: Key Strategies and Resources
- JP Morgan AI Interview Questions: What to Expect and Prepare For
- Best Books to Prepare for Quant Interviews: Essential Reading List
