Data Scientist Interview Question - Amazon
A common interview question asked for the role of data scientist is on different correlation measures and their application.
Question: Explain the correlation measures between different types of variables - between continuous, nominal, categorical, and ordinal variables.
Correlation Measures Between Different Types of Variables
Correlation measures the relationship between two variables. The appropriate correlation method depends on the type of variables:
- Continuous vs. Continuous
- Continuous vs. Categorical (Nominal or Ordinal)
- Categorical vs. Categorical (Nominal or Ordinal)
1. Continuous vs. Continuous Variables
Use: Pearson, Spearman, Kendall
Method | When to Use | Assumptions |
---|---|---|
Pearson Correlation (r) | When both variables are normally distributed and have a linear relationship | Assumes linearity and normality |
Spearman’s Rank Correlation (ρ) | When data is not normally distributed or has a monotonic but not necessarily linear relationship | Assumes monotonicity but not normality |
Kendall’s Tau (τ) | When you have small datasets or many tied ranks | More robust for small samples, assumes monotonicity |
Example
- Pearson: Relationship between height and weight
- Spearman: Relationship between income and happiness (not linear)
- Kendall: Small sample ranking of customer satisfaction and product rating
2. Continuous vs. Categorical Variables
Use: ANOVA, t-test, Point-Biserial, Rank-Biserial
Method | When to Use | Assumptions |
---|---|---|
t-test | Comparing a continuous variable between two groups (binary categorical variable) | Normality in continuous variable |
ANOVA (Analysis of Variance) | Comparing a continuous variable across multiple categorical groups | Normality & equal variance in groups |
Point-Biserial Correlation | When a continuous variable is correlated with a binary categorical variable (e.g., Male/Female vs. Salary) | Similar to Pearson but for binary categories |
Rank-Biserial Correlation | When the continuous variable is non-normal | Similar to Spearman but for binary categories |
Example
- t-test: Comparing salaries between males and females
- ANOVA: Comparing test scores between different education levels
- Point-Biserial: Relationship between gender (0/1) and height
- Rank-Biserial: Relationship between pass/fail (0/1) and test scores
3. Categorical vs. Categorical Variables
Use: Chi-Square, Cramér’s V, Phi Coefficient, Contingency Coefficient
Method | When to Use | Assumptions |
---|---|---|
Chi-Square Test | When both variables are nominal (unordered categories) | Assumes expected frequencies > 5 in most cells |
Cramér’s V | When measuring strength of association in a contingency table | Used for >2x2 tables, ranges from 0 to 1 |
Phi Coefficient (φ) | When both variables are binary (2x2 contingency table) | Similar to Pearson but for binary data |
Contingency Coefficient (C) | When both variables are categorical, adjusting for table size | Values range from 0 to a theoretical max |
Example
- Chi-Square: Relationship between education level and voting preference
- Cramér’s V: Relationship between job role and department
- Phi Coefficient: Relationship between smoking (yes/no) and lung disease (yes/no)
- Contingency Coefficient: Relationship between marital status and region
4. Ordinal vs. Ordinal Variables
Use: Spearman’s Rank, Kendall’s Tau, Gamma, Somers’ D
Method | When to Use | Assumptions |
---|---|---|
Spearman’s Rank Correlation (ρ) | When both variables are ordinal (ranked) | Monotonic relationship, no normality assumption |
Kendall’s Tau (τ) | When both variables are ordinal with small datasets | Works well with tied ranks |
Goodman-Kruskal Gamma | When both variables are ordinal and interested in directional association | No assumption of linearity |
Somers' D | Similar to Gamma but accounts for asymmetric dependency | Used when one variable depends on the other |
Example
- Spearman/Kendall: Relationship between satisfaction rating (1–5) and performance rating (1–5)
- Gamma: Relationship between job experience level (junior, mid, senior) and likelihood of promotion
- Somers' D: Relationship between education level and income level
5. Continuous vs. Ordinal Variables
Use: Spearman’s Rank, Kendall’s Tau, Point-Biserial
Method | When to Use | Assumptions |
---|---|---|
Spearman’s Rank Correlation | When continuous and ordinal data have a monotonic relationship | No normality assumption |
Kendall’s Tau | When dataset is small or has tied ranks | Monotonicity assumption |
Point-Biserial Correlation | If ordinal variable is dichotomous (binary) | Assumes normality in continuous variable |
Example
- Spearman/Kendall: Relationship between employee experience (years) and performance rating (1–5)
- Point-Biserial: Relationship between exam score and pass/fail status