Data Scientist Interview Question - Amazon

A common interview question asked for the role of data scientist is on different correlation measures and their application.

Question: Explain the correlation measures between different types of variables - between continuous, nominal, categorical, and ordinal variables.

Correlation Measures Between Different Types of Variables

Correlation measures the relationship between two variables. The appropriate correlation method depends on the type of variables:

  1. Continuous vs. Continuous
  2. Continuous vs. Categorical (Nominal or Ordinal)
  3. Categorical vs. Categorical (Nominal or Ordinal)

1. Continuous vs. Continuous Variables

Use: Pearson, Spearman, Kendall

Method When to Use Assumptions
Pearson Correlation (r) When both variables are normally distributed and have a linear relationship Assumes linearity and normality
Spearman’s Rank Correlation (ρ) When data is not normally distributed or has a monotonic but not necessarily linear relationship Assumes monotonicity but not normality
Kendall’s Tau (τ) When you have small datasets or many tied ranks More robust for small samples, assumes monotonicity

Example

  • Pearson: Relationship between height and weight
  • Spearman: Relationship between income and happiness (not linear)
  • Kendall: Small sample ranking of customer satisfaction and product rating

2. Continuous vs. Categorical Variables

Use: ANOVA, t-test, Point-Biserial, Rank-Biserial

Method When to Use Assumptions
t-test Comparing a continuous variable between two groups (binary categorical variable) Normality in continuous variable
ANOVA (Analysis of Variance) Comparing a continuous variable across multiple categorical groups Normality & equal variance in groups
Point-Biserial Correlation When a continuous variable is correlated with a binary categorical variable (e.g., Male/Female vs. Salary) Similar to Pearson but for binary categories
Rank-Biserial Correlation When the continuous variable is non-normal Similar to Spearman but for binary categories

Example

  • t-test: Comparing salaries between males and females
  • ANOVA: Comparing test scores between different education levels
  • Point-Biserial: Relationship between gender (0/1) and height
  • Rank-Biserial: Relationship between pass/fail (0/1) and test scores

3. Categorical vs. Categorical Variables

Use: Chi-Square, Cramér’s V, Phi Coefficient, Contingency Coefficient

Method When to Use Assumptions
Chi-Square Test When both variables are nominal (unordered categories) Assumes expected frequencies > 5 in most cells
Cramér’s V When measuring strength of association in a contingency table Used for >2x2 tables, ranges from 0 to 1
Phi Coefficient (φ) When both variables are binary (2x2 contingency table) Similar to Pearson but for binary data
Contingency Coefficient (C) When both variables are categorical, adjusting for table size Values range from 0 to a theoretical max

Example

  • Chi-Square: Relationship between education level and voting preference
  • Cramér’s V: Relationship between job role and department
  • Phi Coefficient: Relationship between smoking (yes/no) and lung disease (yes/no)
  • Contingency Coefficient: Relationship between marital status and region

4. Ordinal vs. Ordinal Variables

Use: Spearman’s Rank, Kendall’s Tau, Gamma, Somers’ D

Method When to Use Assumptions
Spearman’s Rank Correlation (ρ) When both variables are ordinal (ranked) Monotonic relationship, no normality assumption
Kendall’s Tau (τ) When both variables are ordinal with small datasets Works well with tied ranks
Goodman-Kruskal Gamma When both variables are ordinal and interested in directional association No assumption of linearity
Somers' D Similar to Gamma but accounts for asymmetric dependency Used when one variable depends on the other

Example

  • Spearman/Kendall: Relationship between satisfaction rating (1–5) and performance rating (1–5)
  • Gamma: Relationship between job experience level (junior, mid, senior) and likelihood of promotion
  • Somers' D: Relationship between education level and income level

5. Continuous vs. Ordinal Variables

Use: Spearman’s Rank, Kendall’s Tau, Point-Biserial

Method When to Use Assumptions
Spearman’s Rank Correlation When continuous and ordinal data have a monotonic relationship No normality assumption
Kendall’s Tau When dataset is small or has tied ranks Monotonicity assumption
Point-Biserial Correlation If ordinal variable is dichotomous (binary) Assumes normality in continuous variable

Example

  • Spearman/Kendall: Relationship between employee experience (years) and performance rating (1–5)
  • Point-Biserial: Relationship between exam score and pass/fail status