blog-cover-image

Meta Data Scientist Interview Questions with SQL and Python Examples

In this article, we’ll solve and explain two common Meta Data Scientist interview questions: a complex SQL problem involving multiple tables, and a Python data manipulation scenario involving CSV parsing. Each solution includes a detailed breakdown of the concepts and reasoning, ensuring you’re well-prepared for your next big interview.


Meta Data Scientist Interview Question


Q1. SQL: Unique Employees per Project per Month

Question: Given the following tables:

  • employees(id, unixname, team, role, days_since_started)
  • projects(id, name, ...)
  • commits(id, file_path, proj_id, auth_id, timestamp)

Find the number of unique employees per project per month.

Step 1: Understand the Tables and Relationships

  • employees: Each row represents an employee, uniquely identified by id.
  • projects: Each row represents a project, uniquely identified by id.
  • commits: Each row represents a code commit. proj_id is the project, auth_id is the employee (author), timestamp is when the commit was made.

We want, for each project and each month, to count how many unique employees made commits.

Step 2: Identify Relevant Columns and Aggregations

We need to:

  • Group commits by proj_id (project) and by month.
  • For each group, count distinct auth_id (unique employees).

To extract the month, we’ll need to parse the timestamp (assume it's in a standard SQL datetime format).

Step 3: Write the SQL Query


SELECT
    c.proj_id,
    p.name AS project_name,
    DATE_TRUNC('month', c.timestamp) AS month,
    COUNT(DISTINCT c.auth_id) AS unique_employee_count
FROM
    commits c
    JOIN projects p ON c.proj_id = p.id
GROUP BY
    c.proj_id,
    p.name,
    DATE_TRUNC('month', c.timestamp)
ORDER BY
    c.proj_id,
    month;

Step 4: Detailed Explanation

  • DATE_TRUNC('month', c.timestamp): Normalizes each timestamp to the first day of its month. This allows us to group all commits from the same month together. You may also use EXTRACT(YEAR FROM c.timestamp) and EXTRACT(MONTH FROM c.timestamp) in some SQL dialects.
  • COUNT(DISTINCT c.auth_id): Counts the number of unique employees who made commits for that project in that month.
  • We join projects to get the human-readable project name.
  • Grouping on proj_id, project_name, and month yields the desired breakdown.

Step 5: SQL Dialect Considerations

Different SQL engines (PostgreSQL, MySQL, BigQuery, etc.) handle month extraction slightly differently. Here are some alternatives:

  • PostgreSQL: DATE_TRUNC('month', timestamp)
  • MySQL: DATE_FORMAT(timestamp, '%Y-%m-01')
  • BigQuery: EXTRACT(YEAR FROM timestamp), EXTRACT(MONTH FROM timestamp)

Step 6: Sample Output

proj_id project_name month unique_employee_count
1 Messenger 2024-02-01 5
1 Messenger 2024-03-01 7
2 WhatsApp 2024-02-01 3

Step 7: Mathematical Representation

Let \( P \) be the set of projects, \( M \) the set of months, and \( E \) the set of employees. For each project \( p \) and month \( m \):

\[ \text{UniqueEmployeeCount}_{p,m} = \left| \{ e \in E : \exists c \in \text{commits}, c.proj\_id = p, \text{month}(c.timestamp) = m, c.auth\_id = e \} \right| \]

Step 8: Performance Notes

  • Add indexes on commits.proj_id, commits.timestamp, and commits.auth_id for efficiency.
  • If commits is very large, consider partitioning by proj_id or timestamp.

Q2. Python: Building Employee Lists per Project from a CSV

Question: Imagine you have a CSV file:


john_doe,android,ios,infra
bob_law,is,backend
jane_doe,frontend

Each row contains an employee (first column), followed by the projects they work on. Write a script that reads this CSV and creates a data structure for each project, listing the employees who work on it.

Step 1: Problem Analysis

  • Input: CSV file, rows are employee, project1, project2, ....
  • Output: For each project, a list of employees working on it.

For example, from the above data, the result should be:


{
    "android": ["john_doe"],
    "ios": ["john_doe"],
    "infra": ["john_doe"],
    "is": ["bob_law"],
    "backend": ["bob_law"],
    "frontend": ["jane_doe"]
}

Step 2: Step-by-Step Solution in Python

  1. Read the CSV file line by line.
  2. For each row, extract the employee name (first element) and the list of projects (remaining elements).
  3. For each project, add the employee to the list of employees for that project.

Step 3: Implementation


from collections import defaultdict
import csv

def employees_per_project(csv_filename):
    project_employees = defaultdict(list)
    with open(csv_filename, newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            if not row:
                continue  # skip empty lines
            employee = row[0].strip()
            # Avoid adding empty project names
            for project in row[1:]:
                project = project.strip()
                if project:
                    project_employees[project].append(employee)
    return dict(project_employees)

Let’s break down the code:

  • defaultdict(list): Automatically creates a list for each new project encountered.
  • csv.reader: Safely handles parsing CSV, including commas inside quoted strings.
  • For each row, the first column is the employee name; the rest are project names.
  • Each non-empty project name gets the employee added to its list.

Step 4: Example Usage


# Save the CSV data in 'projects.csv'
result = employees_per_project('projects.csv')
print(result)
# Output:
# {'android': ['john_doe'], 'ios': ['john_doe'], 'infra': ['john_doe'], 'is': ['bob_law'], 'backend': ['bob_law'], 'frontend': ['jane_doe']}

Step 5: Handling Edge Cases

  • Duplicate Employees: If the same employee appears for the same project multiple times, the above code will add them as many times. To ensure unique employees per project, use a set instead of a list:

from collections import defaultdict
import csv

def employees_per_project_unique(csv_filename):
    project_employees = defaultdict(set)
    with open(csv_filename, newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            if not row:
                continue
            employee = row[0].strip()
            for project in row[1:]:
                project = project.strip()
                if project:
                    project_employees[project].add(employee)
    # Convert sets back to lists for output
    return {k: list(v) for k, v in project_employees.items()}

Step 6: Mathematical Representation

Let \( E \) be the set of employees, \( P \) the set of projects. For each project \( p \), define

\[ \text{Employees}(p) = \{ e \in E : e \text{ lists } p \text{ as one of their projects} \} \]

Step 7: Using Pandas for Larger CSVs

For very large files or when you want to leverage DataFrame capabilities, pandas is a good choice.


import pandas as pd

def employees_per_project_pandas(csv_filename):
    df = pd.read_csv(csv_filename, header=None)
    project_employees = {}
    for _, row in df.iterrows():
        employee = str(row[0]).strip()
        for project in row[1:]:
            if pd.isna(project):
                continue
            project = str(project).strip()
            if not project:
                continue
            if project not in project_employees:
                project_employees[project] = []
            project_employees[project].append(employee)
    return project_employees

Step 8: Extending the Data Structure

If you want to store additional information (such as employee roles, teams, etc.), consider using dictionaries of dictionaries:


# project_employees = {
#     "android": [{"name": "john_doe", "role": "devops"}, ...],
#     ...
# }

Step 9: Output as a Table

Project Employees
android john_doe
ios john_doe
infra john_doe
is bob_law
backend bob_law
frontend jane_doe

Conceptual Takeaways

SQL: Group By, Aggregation, and Date Handling

  • Group By: Used to aggregate data per project and per month.
  • COUNT(DISTINCT ...): Counts unique values (here, unique employees per group).
  • Date Truncation: Collapses timestamps into months for grouping.

Python: CSV Parsing, Dictionaries, and Data Structures

  • CSV Parsing: Handled safely using Python’s built-in csv module.
  • Defaultdict: Simplifies code for accumulating lists or sets.
  • Handling Duplicates: Use set for uniqueness.
  • Pandas: Useful for large data and tabular manipulation.

Practice and Next Steps

Mastering these types of interview questions is crucial for success in a Meta Data Scientist interview. Practice writing SQL queries that aggregate and join multiple tables, especially using GROUP BY with date functions and COUNT(DISTINCT ...). In Python, get comfortable with file I/O, dictionary/set manipulations, and using libraries like csv and pandas.

Remember: in interviews, always explain your reasoning step by step, clarify assumptions, and consider edge cases.


Summary

  • SQL Question: Focused on grouping, joining, and unique counting with date normalization.
  • Python Question: Centered on parsing semi-structured data and reversing mappings using dictionaries.
  • Core Concepts: Aggregations, uniqueness, data structures, and scalability for large data.

With these explanations and solutions, you’re better equipped to tackle Meta Data Scientist interview questions confidently and efficiently.

Related Articles