ใ€€

blog-cover-image

Top 5 Data Scientist Interview Questions from Amazon and Meta

Data science interviews at top tech companies like Amazon and Meta are known for their challenging and insightful questions. These questions not only test your technical proficiency in SQL, algorithms, and data manipulation, but also your ability to interpret real-world data scenarios, optimize for performance, and communicate your thought process clearly. In this article, we break down and answer the top 5 data scientist interview questions commonly asked at Amazon and Meta, explaining all relevant concepts and providing detailed solutions. Whether you’re preparing for your next big interview or want to strengthen your data science skills, this comprehensive guide will be invaluable.

Top 5 Data Scientist Interview Questions from Amazon and Meta


1. Meta SQL Interview Question: Calculating the Overall Friend Request Acceptance Rate

Question:

Given two tables:

  • Friend_request (requester_id, sent_to_id, time)
  • Request_accepted (acceptor_id, requestor_id, time)

Find the overall acceptance rate of requests.

Understanding the Problem

We are given two tables. Friend_request logs all friend requests sent between users, and Request_accepted logs only those friend requests that have been accepted. The acceptance rate is defined as:

$$ \text{Acceptance Rate} = \frac{\text{Number of Friend Requests Accepted}}{\text{Total Number of Friend Requests Sent}} $$

Approach

  • Count the total number of friend requests from Friend_request.
  • Count the number of accepted requests from Request_accepted.
  • Divide the number of accepted requests by the total number sent.

SQL Solution


SELECT 
  (SELECT COUNT(*) FROM Request_accepted) * 1.0 / 
  (SELECT COUNT(*) FROM Friend_request) AS acceptance_rate;

Explanation

The subqueries count the total records in Request_accepted and Friend_request. We multiply by 1.0 to ensure we get a decimal (floating point) result, not integer division.

Alternatively, you can use a LEFT JOIN to match requests that were accepted:


SELECT 
  SUM(CASE WHEN ra.requestor_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS acceptance_rate
FROM Friend_request fr
LEFT JOIN Request_accepted ra
  ON fr.requester_id = ra.requestor_id AND fr.sent_to_id = ra.acceptor_id;

This query joins each request to its corresponding acceptance (if any), sums up the accepted ones, and divides by the total number of requests.

If the interviewer wants the acceptance rate by user, we can write a detailed query like the one below which takes care of all rates:

WITH all_users AS (
    SELECT requester_id as user_id FROM Friend_request
    UNION
    SELECT sent_to_id as user_id FROM Friend_request
    UNION
    SELECT acceptor_id as user_id FROM Request_accepted
    UNION
    SELECT requestor_id as user_id FROM Request_accepted
),
sent_stats AS (
    SELECT 
        requester_id as user_id,
        COUNT(DISTINCT sent_to_id) as requests_sent
    FROM Friend_request
    GROUP BY requester_id
),
accepted_stats AS (
    SELECT 
        requestor_id as user_id,
        COUNT(DISTINCT acceptor_id) as requests_accepted
    FROM Request_accepted
    GROUP BY requestor_id
)
SELECT 
    au.user_id,
    COALESCE(ss.requests_sent, 0) as total_requests_sent,
    COALESCE(asr.requests_accepted, 0) as total_requests_accepted,
    CASE 
        WHEN COALESCE(ss.requests_sent, 0) = 0 THEN 0
        ELSE ROUND(COALESCE(asr.requests_accepted, 0) * 100.0 / ss.requests_sent, 2)
    END as acceptance_rate_percentage
FROM all_users au
LEFT JOIN sent_stats ss ON au.user_id = ss.user_id
LEFT JOIN accepted_stats asr ON au.user_id = asr.user_id
ORDER BY au.user_id;

2. Meta SQL Interview Question: Who Has the Most Friends?

Question:

Given the following tables, how would you know who has the most friends?

  • REQUESTS (date, sender_id, accepter_id)
  • ACCEPTED (accepted_at, accepter_id, sender_id)

Understanding the Problem

Each row in ACCEPTED indicates a successful friend connection between accepter_id and sender_id. Friendship is mutual, so each accepted request makes both participants friends with each other. We need to count, for each user, the total number of unique friends, and find the user with the most friends.

Approach

  • For every row in ACCEPTED, both accepter_id and sender_id gain a friend.
  • Stack the accepter_id and sender_id columns together, grouping by user to count unique friends.
  • Find the user(s) with the maximum count.

SQL Solution


WITH friends AS (
  SELECT accepter_id AS user_id, sender_id AS friend_id FROM ACCEPTED
  UNION
  SELECT sender_id AS user_id, accepter_id AS friend_id FROM ACCEPTED
)
SELECT user_id, COUNT(DISTINCT friend_id) AS num_friends
FROM friends
GROUP BY user_id
ORDER BY num_friends DESC
LIMIT 1;

Explanation

  • The WITH friends AS ... CTE stacks both directions of the friendship.
  • We count DISTINCT friend_id for each user_id to avoid double-counting.
  • ORDER BY num_friends DESC LIMIT 1 gives the user with the most friends.

3. Amazon SQL Interview Question: LEFT JOIN with WHERE Filter

Question:

You have:

  • Table 1: 1 million records, columns: ID, AGE
  • Table 2: 100 records, columns: ID, SALARY

Given SQL:


SELECT A.ID, A.AGE, B.SALARY
FROM TABLE1 A
LEFT JOIN TABLE2 B
  ON A.ID = B.ID
WHERE B.SALARY > 50000

How many records would be returned? (And the interviewer asks to modify this line of query.)

Understanding the Problem

A LEFT JOIN returns all rows from TABLE1, and the matching rows from TABLE2. If there's no match, B.SALARY is NULL.

However, the WHERE B.SALARY > 50000 filter removes all rows where B.SALARY is NULL or less than or equal to 50,000. In effect, this turns the LEFT JOIN into an INNER JOIN on the condition that B.SALARY > 50000.

Number of Records Returned

Since TABLE2 has only 100 records, and only those with SALARY > 50000 will be included, the maximum number of records returned is equal to the number of IDs in TABLE1 that have a matching ID in TABLE2 and SALARY > 50000.

  • If TABLE1 and TABLE2 have the same IDs, and all SALARY > 50000, then at most 100 rows.
  • If only 10 rows in TABLE2 have SALARY > 50000, only those matching rows from TABLE1 would be returned (up to 10).

So, unless otherwise specified, the answer is: The number of rows in TABLE2 where SALARY > 50000 (assuming each ID is unique and matches in TABLE1).

Modification to the Query

The interviewer likely wants to include rows from TABLE1 even when there's no matching TABLE2 row (i.e., keep B.SALARY as NULL), while still filtering on salary where available.

This can be achieved by moving the B.SALARY > 50000 into the ON clause:


SELECT A.ID, A.AGE, B.SALARY
FROM TABLE1 A
LEFT JOIN TABLE2 B
  ON A.ID = B.ID AND B.SALARY > 50000

Now, all 1 million rows from TABLE1 are returned. If there's a matching B.ID with SALARY > 50000, B.SALARY is filled; otherwise, it's NULL.

Key Takeaways

  • Filtering in WHERE after a LEFT JOIN can turn it into an INNER JOIN.
  • Putting the condition in the ON clause preserves LEFT JOIN behavior, returning all rows from the left table.

4. Amazon Coding Interview: Aggregating Data from a Large CSV File

Question:

Given a CSV file with columns ID and Quantity (50 million records, 2GB file size), write a program in any language to aggregate (sum) the Quantity column.

Understanding the Problem

We need to efficiently process a large CSV file and compute the sum of the Quantity column. This is a classic case for streaming the file line by line to keep memory usage low.

Python Solution


import csv

def aggregate_quantity(file_path):
    total_quantity = 0
    with open(file_path, 'r', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            total_quantity += int(row['Quantity'])
    return total_quantity

# Example usage:
# print(aggregate_quantity('large_file.csv'))

Explanation

  • csv.DictReader reads the file row by row.
  • We sum the Quantity field as integers, keeping a running total.
  • Only a small amount of memory is used regardless of file size.

Optimization Tips

  • If the file has no header, use csv.reader and access by index.
  • For even faster performance, consider using pandas.read_csv with usecols and iterator=True for chunked processing.

Pandas Chunked Solution


import pandas as pd

def aggregate_quantity_pandas(file_path):
    total = 0
    for chunk in pd.read_csv(file_path, usecols=['Quantity'], chunksize=10**6):
        total += chunk['Quantity'].sum()
    return total

5. Meta SQL Interview Question: Longest User Visit Streak as of Yesterday

Question:

Given a table with columns user_id and date (the dates users visited the platform), find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.

Understanding the Problem

We need to compute, for each user, the length of their most recent continuous streak of daily visits up to (and including) yesterday.

Key Concepts

  • Streak: A sequence of consecutive days of activity.
  • We are interested in streaks ending on yesterday’s date.

SQL Solution

Let’s assume the table is called user_visits (user_id, visit_date).


WITH sorted_visits AS (
  SELECT 
    user_id, 
    visit_date,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY visit_date DESC) AS rn
  FROM user_visits
  WHERE visit_date <= CURRENT_DATE - INTERVAL '1 day'
),
streaks AS (
  SELECT
    user_id,
    visit_date,
    rn,
    DATE_SUB(visit_date, INTERVAL rn DAY) AS streak_group
  FROM sorted_visits
),
user_streaks AS (
  SELECT
    user_id,
    COUNT(*) AS streak_length
  FROM streaks
  WHERE streak_group = (
    SELECT DATE_SUB(CURRENT_DATE - INTERVAL '1 day', INTERVAL rn DAY)
    FROM sorted_visits
    WHERE user_id = streaks.user_id AND rn = 1
  )
  GROUP BY user_id
)
SELECT user_id, streak_length
FROM user_streaks
ORDER BY streak_length DESC
LIMIT 100;

Explanation

  • sorted_visits: Assigns a row number to each visit per user, ordered descending by date, only up to yesterday.
  • streaks: Calculates a “group” identifier by subtracting the row number from the date — consecutive dates will have the same streak_group.
  • user_streaks: For each user, counts the number in the most recent group (i.e., the current streak ending yesterday).

This approach isolates the current streak and returns the 100 users with the longest streaks.


6. Meta SQL Interview Question: Tracking Feature Usage Over Time

Question:

Given a table that tracks every time a user turns a feature on or off with columns: user_id, action ("on" or "off"), date, and time:

  • How many users turned the feature on today?
  • How many users have ever turned the feature on?
  • In a table that tracks the status of every user every day, how would you add today’s data to it?

Understanding the Problem

  • Turning on today: Users who performed an "on" action today (distinct users).
  • Ever turned on: Users with at least one "on" action at any time.
  • Daily

    Daily Status Tracking: How to Update the Status Table

    Suppose you have a table feature_usage with (user_id, action, date, time), and you maintain a table user_feature_daily_status with (user_id, date, status), where status indicates whether the feature was ON or OFF at the end of each day for every user.

    1. How many users turned the feature on today?

    
    SELECT COUNT(DISTINCT user_id) AS num_users_turned_on_today
    FROM feature_usage
    WHERE action = 'on' AND date = CURRENT_DATE;
    

    This query simply counts distinct users who performed the 'on' action today.

    2. How many users have ever turned the feature on?

    
    SELECT COUNT(DISTINCT user_id) AS num_users_ever_turned_on
    FROM feature_usage
    WHERE action = 'on';
    

    This query counts distinct users who have at any time turned the feature on.

    3. Adding Today's Status to the Daily Status Table

    To update the user_feature_daily_status table for today, determine for each user whether the feature is ON or OFF at the end of today. This typically means:

    • For each user, find the latest action (by date and time) as of today.
    • If the latest action is 'on', status is ON; if 'off', status is OFF.
    
    WITH latest_action AS (
      SELECT
        user_id,
        action,
        ROW_NUMBER() OVER (
          PARTITION BY user_id
          ORDER BY date DESC, time DESC
        ) AS rn
      FROM feature_usage
      WHERE date = CURRENT_DATE
    )
    INSERT INTO user_feature_daily_status (user_id, date, status)
    SELECT user_id, CURRENT_DATE, action AS status
    FROM latest_action
    WHERE rn = 1;
    

    This CTE finds the most recent action per user for today, and inserts the status into the daily tracking table.


    Key Concepts Explained

    1. LEFT JOIN vs INNER JOIN in SQL

    A LEFT JOIN returns all records from the left table and matched records from the right table. If there's no match, the result is NULL on the right side. An INNER JOIN returns only records with matches in both tables.

    LEFT JOIN INNER JOIN
    All rows from left, plus matched from right (NULL if no match) Only rows where there is a match in both tables
    Can result in more rows (including NULLs) Never includes rows without a match

    2. Aggregating Large Datasets

    When working with very large datasets (like 50 million records), streaming the data line by line or processing in chunks (e.g., using pandas with chunksize) is critical for efficient memory usage and speed.

    3. Calculating Streaks in SQL

    A common technique for finding streaks is to use the difference between date and row number. For example, for consecutive dates, the result of date - ROW_NUMBER() is constant, so you can group on this value to identify streaks.

    4. Maintaining Daily Status Tables

    To track daily user status, always use the latest action for that day. If there are multiple actions, pick the one with the latest timestamp.


    Sample Datasets for Practice

    Here are a few example datasets you can use to practice these interview questions:

    Friend_request
    requester_id sent_to_id time
    1 2 2024-06-01 10:00:00
    2 3 2024-06-01 11:00:00
    Request_accepted
    acceptor_id requestor_id time
    2 1 2024-06-01 10:30:00

    Summary Table: Top 5 Questions and Key Concepts

    Company Question Topic Key SQL/Python Concept What It Tests
    Meta Acceptance Rate Aggregation, Subqueries, LEFT JOIN Basic aggregation, join logic
    Meta Most Friends UNION, GROUP BY, COUNT(DISTINCT) Set manipulation, counting relationships
    Amazon LEFT JOIN & WHERE Clause JOIN types, WHERE vs ON clause Effect of filtering on joins
    Amazon Large CSV Aggregation File streaming, aggregation Efficiency, Python coding
    Meta Longest Streak Window functions, streak grouping Advanced SQL, analytics
    Meta Feature Usage Events Window functions, status tracking Temporal analytics, ETL logic

    Interview Tips for Data Science Positions at Amazon & Meta

    • Master SQL fundamentals. Practice subqueries, window functions, and different types of joins.
    • Understand business context. Know why you’re aggregating, which denominator to use, and how to interpret results.
    • Think about edge cases. Null values, duplicate rows, and non-matching keys are common pitfalls.
    • Communicate clearly. Explain each step of your logic. Your approach matters as much as the final answer.
    • Be efficient with code. For coding questions, show that you can process large data efficiently (streaming, chunking, etc.).
    • Practice with real datasets. Use open datasets to simulate the scale and complexity of the interview environment.

    Conclusion

    The data scientist interview process at Amazon and Meta is rigorous and highly competitive. By mastering these top interview questions—ranging from SQL aggregations and joins to streak calculations, feature tracking, and big data processing—you’ll be well-prepared for the technical challenges ahead. Remember, these questions not only test your technical skills but also your analytical thinking and communication. Practice regularly, focus on understanding the underlying business logic, and you'll be set for success in your next data science interview.

    Best of luck on your data science journey!

Related Articles