ใ€€

blog-cover-image

Top 4 SQL Data Scientist Interview Questions from Meta

Are you preparing for a data scientist interview at Meta (formerly Facebook)? The selection process at Meta is rigorous and competitive, demanding not only strong technical skills but also the ability to solve real-world problems at scale. In this article, we delve into the top 4 data scientist interview questions from Meta—questions that test your skills in SQL, algorithms, and your understanding of data processing. Each question is followed by a detailed answer and thorough explanation, ensuring you’re equipped to ace your interview.


Top 4 Data Scientist Interview Questions from Meta

1. SQL Recommendation Query: Suggest Pages Based on Your Friends’ Likes

One of the most common interview scenarios at Meta involves recommendation systems. Here, you are given two tables:

  • friends: Contains pairs of users and their friends.
  • likes: Contains pairs of users and the pages they have liked.

The task: Write an SQL query that recommends new pages to a user, based on pages liked by their friends, excluding pages the user already likes.

Table Definitions

friends likes
  • user_id
  • friend_id
  • user_id
  • page_id

Requirements

  • The recommendation should only include pages that the user has not already liked.
  • Each page recommended should be liked by at least one of the user's friends.

SQL Solution


SELECT
    f.user_id AS user,
    l2.page_id AS recommended_page
FROM
    friends f
    JOIN likes l1 ON f.friend_id = l1.user_id
    JOIN likes l2 ON f.user_id = l2.user_id
WHERE
    l2.page_id NOT IN (
        SELECT page_id FROM likes WHERE user_id = f.user_id
    )
GROUP BY
    f.user_id, l1.page_id

Explanation

  • friends f: We iterate over each user and their friends.
  • likes l1: Join with likes to find all pages liked by each friend.
  • WHERE clause: Excludes pages already liked by the user. The subquery fetches all pages the user has already liked, and the main query ensures we only get new recommendations.
  • GROUP BY: Groups recommendations by user and page, so each recommended page appears only once per user.

Optimized Version (using LEFT JOIN and DISTINCT)


SELECT DISTINCT
    f.user_id AS user,
    l.page_id AS recommended_page
FROM
    friends f
    JOIN likes l ON f.friend_id = l.user_id
    LEFT JOIN likes l2 ON l2.user_id = f.user_id AND l2.page_id = l.page_id
WHERE
    l2.page_id IS NULL

Here, the LEFT JOIN ensures we only recommend pages not already liked by the user (where l2.page_id IS NULL).

Concepts Involved

  • Joins (INNER JOIN, LEFT JOIN)
  • Subqueries
  • Set difference in SQL
  • Recommendation logic

2. Algorithm: Find the Removed Element between Two Lists

This is a popular Meta interview question that tests your understanding of algorithms, data structures, and computational efficiency.

Problem Statement: Given a list A of objects and another list B which is identical to A except that one element is removed, find that removed element.

Example


A = [5, 7, 9, 3, 2]
B = [7, 3, 5, 2]
# The removed element is 9

Solution 1: Using Hash Maps

Count the frequency of each element in B and decrement counts based on A. The element with count zero is the missing one.


from collections import Counter

def find_removed(A, B):
    count = Counter(B)
    for num in A:
        if count[num]:
            count[num] -= 1
        else:
            return num

Solution 2: Using Set and Sum

If elements are hashable and there are no duplicates, the difference in sum gives the removed element.


def find_removed(A, B):
    return sum(A) - sum(B)

Solution 3: Using XOR (Bit Manipulation)

For integer lists (even with duplicates), XORing all elements in both lists will cancel out the common elements, leaving the removed one.


def find_removed(A, B):
    result = 0
    for num in A + B:
        result ^= num
    return result

Mathematical Explanation (Mathjax)

If the lists are:

\( A = [a_1, a_2, \ldots, a_n] \), \( B = [b_1, b_2, \ldots, b_{n-1}] \)

Then,

\( \text{Removed element} = \sum_{i=1}^{n} a_i - \sum_{j=1}^{n-1} b_j \)

Concepts Involved

  • Hash tables / dictionaries
  • Set operations
  • Bitwise manipulation
  • Time and space complexity analysis

3. SQL: Frequency Table with Two Joins, GROUP BY, ORDER BY, and NULL Handling

Another common Meta data science interview question focuses on your ability to write complex SQL queries, especially involving multiple joins and aggregations.

Problem Statement

Given a set of tables, write an SQL query to compute a frequency table (i.e., count of occurrences) of a certain attribute that requires two joins. Also, explain how to apply GROUP BY and ORDER BY. Discuss how to handle NULLs.

Example Schema

  • users (user_id, name)
  • orders (order_id, user_id, product_id)
  • products (product_id, category)

Suppose you want to compute a frequency table of the number of orders per category.

SQL Query


SELECT
    p.category,
    COUNT(*) AS order_count
FROM
    orders o
    JOIN products p ON o.product_id = p.product_id
    JOIN users u ON o.user_id = u.user_id
GROUP BY
    p.category
ORDER BY
    order_count DESC

Explanation

  • Two Joins: orders is joined with products (to get the category) and users (to link orders to users).
  • GROUP BY: Groups the result by product category, so you get a row per category.
  • ORDER BY: Orders the categories by the number of orders, descending.
  • COUNT(*): Counts the number of orders in each group.

Handling NULLs

If there are rows in orders without matching products or users, you may want to use LEFT JOIN instead of INNER JOIN. This will include all orders, even those with missing product or user information.


SELECT
    p.category,
    COUNT(*) AS order_count
FROM
    orders o
    LEFT JOIN products p ON o.product_id = p.product_id
    LEFT JOIN users u ON o.user_id = u.user_id
GROUP BY
    p.category
ORDER BY
    order_count DESC

To count only non-NULL categories:


SELECT
    p.category,
    COUNT(*) AS order_count
FROM
    orders o
    LEFT JOIN products p ON o.product_id = p.product_id
    LEFT JOIN users u ON o.user_id = u.user_id
WHERE
    p.category IS NOT NULL
GROUP BY
    p.category
ORDER BY
    order_count DESC

Concepts Involved

  • Joins (INNER and LEFT)
  • GROUP BY and aggregation
  • ORDER BY for sorted results
  • NULL handling in SQL queries

4. SQL: Updating a Cumulative Sum Table with Real-Time Data

Meta often tests your data engineering skills by asking you to write SQL for updating historical summary tables using incremental data. Here’s a classic scenario:

Problem Statement

Given two tables:

  • lifetime_counts: Contains date, user_id, song_id, and count (cumulative sum of listens up to that day for each user-song pair).
  • daily_listens: Contains real-time data of user-song listens on a particular day (date, user_id, song_id, count).

At the end of each day, you need to update lifetime_counts by:

  1. Adding the day's listen counts from daily_listens to the corresponding cumulative count in lifetime_counts for each user_id and song_id.
  2. If a user-song pair does not exist in lifetime_counts, insert it with the day's count.

Example Data

lifetime_counts daily_listens
  • 2024-06-01, user1, song1, 10
  • 2024-06-01, user2, song2, 5
  • 2024-06-02, user1, song1, 2
  • 2024-06-02, user3, song3, 4

SQL Solution (Using MERGE or Upsert)

Many modern SQL databases support the MERGE statement or variants like INSERT ... ON DUPLICATE KEY UPDATE for MySQL or INSERT ... ON CONFLICT DO UPDATE for PostgreSQL.

Example for PostgreSQL


INSERT INTO lifetime_counts (date, user_id, song_id, count)
SELECT
    '2024-06-02' AS date, user_id, song_id, count
FROM
    daily_listens
WHERE
    date = '2024-06-02'
ON CONFLICT (user_id, song_id) DO UPDATE
SET
    count = lifetime_counts.count + EXCLUDED.count,
    date = EXCLUDED.date;

Example for MySQL


INSERT INTO lifetime_counts (date, user_id, song_id, count)
SELECT
    '2024-06-02' AS date, user_id, song_id, count
FROM
    daily_listens
WHERE
    date = '2024-06-02'
ON DUPLICATE KEY UPDATE
    count = count + VALUES(count),
    date = VALUES(date);

Explanation

  • INSERT ... SELECT: Inserts the day's new listens.
  • ON CONFLICT/ON DUPLICATE KEY UPDATE: If the pair already exists, update the count by adding the new listens; otherwise, insert a new row.
  • date: Update the date to the latest (or keep as needed).

Concepts Involved

  • Upsert logic (MERGE, ON CONFLICT, ON DUPLICATE KEY UPDATE)
  • Batch data processing
  • Incremental update of summary tables
  • Cumulative sum logic

Conclusion

Interviewing for a data scientist role at Meta requires a deep understanding of both theoretical and practical concepts in data science, statistics, and SQL. The questions above are among the most frequently asked and are designed to test your problem-solving ability, technical proficiency, and attention to real-world scalability.

Mastering these questions—ranging from SQL recommendations, algorithmic puzzles, complex joins with aggregation, to data pipeline updates—will give you a significant advantage in your Meta data scientist interview. Practice writing clean, efficient code, and always be ready to explain the reasoning behind your solutions.

Best of luck with your Meta data scientist interview!

Related Articles