
Top 5 Data Scientist Interview Questions from Amazon and Meta
Data science interviews at top tech companies like Amazon and Meta are known for their challenging and insightful questions. These questions not only test your technical proficiency in SQL, algorithms, and data manipulation, but also your ability to interpret real-world data scenarios, optimize for performance, and communicate your thought process clearly. In this article, we break down and answer the top 5 data scientist interview questions commonly asked at Amazon and Meta, explaining all relevant concepts and providing detailed solutions. Whether you’re preparing for your next big interview or want to strengthen your data science skills, this comprehensive guide will be invaluable.
Top 5 Data Scientist Interview Questions from Amazon and Meta
1. Meta SQL Interview Question: Calculating the Overall Friend Request Acceptance Rate
Question:
Given two tables:
- Friend_request (requester_id, sent_to_id, time)
- Request_accepted (acceptor_id, requestor_id, time)
Find the overall acceptance rate of requests.
Understanding the Problem
We are given two tables. Friend_request logs all friend requests sent between users, and Request_accepted logs only those friend requests that have been accepted. The acceptance rate is defined as:
$$ \text{Acceptance Rate} = \frac{\text{Number of Friend Requests Accepted}}{\text{Total Number of Friend Requests Sent}} $$
Approach
- Count the total number of friend requests from Friend_request.
- Count the number of accepted requests from Request_accepted.
- Divide the number of accepted requests by the total number sent.
SQL Solution
SELECT
(SELECT COUNT(*) FROM Request_accepted) * 1.0 /
(SELECT COUNT(*) FROM Friend_request) AS acceptance_rate;
Explanation
The subqueries count the total records in Request_accepted and Friend_request. We multiply by 1.0 to ensure we get a decimal (floating point) result, not integer division.
Alternatively, you can use a LEFT JOIN to match requests that were accepted:
SELECT
SUM(CASE WHEN ra.requestor_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS acceptance_rate
FROM Friend_request fr
LEFT JOIN Request_accepted ra
ON fr.requester_id = ra.requestor_id AND fr.sent_to_id = ra.acceptor_id;
This query joins each request to its corresponding acceptance (if any), sums up the accepted ones, and divides by the total number of requests.
If the interviewer wants the acceptance rate by user, we can write a detailed query like the one below which takes care of all rates:
WITH all_users AS (
SELECT requester_id as user_id FROM Friend_request
UNION
SELECT sent_to_id as user_id FROM Friend_request
UNION
SELECT acceptor_id as user_id FROM Request_accepted
UNION
SELECT requestor_id as user_id FROM Request_accepted
),
sent_stats AS (
SELECT
requester_id as user_id,
COUNT(DISTINCT sent_to_id) as requests_sent
FROM Friend_request
GROUP BY requester_id
),
accepted_stats AS (
SELECT
requestor_id as user_id,
COUNT(DISTINCT acceptor_id) as requests_accepted
FROM Request_accepted
GROUP BY requestor_id
)
SELECT
au.user_id,
COALESCE(ss.requests_sent, 0) as total_requests_sent,
COALESCE(asr.requests_accepted, 0) as total_requests_accepted,
CASE
WHEN COALESCE(ss.requests_sent, 0) = 0 THEN 0
ELSE ROUND(COALESCE(asr.requests_accepted, 0) * 100.0 / ss.requests_sent, 2)
END as acceptance_rate_percentage
FROM all_users au
LEFT JOIN sent_stats ss ON au.user_id = ss.user_id
LEFT JOIN accepted_stats asr ON au.user_id = asr.user_id
ORDER BY au.user_id;
2. Meta SQL Interview Question: Who Has the Most Friends?
Question:
Given the following tables, how would you know who has the most friends?
- REQUESTS (date, sender_id, accepter_id)
- ACCEPTED (accepted_at, accepter_id, sender_id)
Understanding the Problem
Each row in ACCEPTED indicates a successful friend connection between accepter_id and sender_id. Friendship is mutual, so each accepted request makes both participants friends with each other. We need to count, for each user, the total number of unique friends, and find the user with the most friends.
Approach
- For every row in ACCEPTED, both
accepter_idandsender_idgain a friend. - Stack the
accepter_idandsender_idcolumns together, grouping by user to count unique friends. - Find the user(s) with the maximum count.
SQL Solution
WITH friends AS (
SELECT accepter_id AS user_id, sender_id AS friend_id FROM ACCEPTED
UNION
SELECT sender_id AS user_id, accepter_id AS friend_id FROM ACCEPTED
)
SELECT user_id, COUNT(DISTINCT friend_id) AS num_friends
FROM friends
GROUP BY user_id
ORDER BY num_friends DESC
LIMIT 1;
Explanation
- The
WITH friends AS ...CTE stacks both directions of the friendship. - We count
DISTINCT friend_idfor eachuser_idto avoid double-counting. ORDER BY num_friends DESC LIMIT 1gives the user with the most friends.
3. Amazon SQL Interview Question: LEFT JOIN with WHERE Filter
Question:
You have:
- Table 1: 1 million records, columns: ID, AGE
- Table 2: 100 records, columns: ID, SALARY
Given SQL:
SELECT A.ID, A.AGE, B.SALARY
FROM TABLE1 A
LEFT JOIN TABLE2 B
ON A.ID = B.ID
WHERE B.SALARY > 50000
How many records would be returned? (And the interviewer asks to modify this line of query.)
Understanding the Problem
A LEFT JOIN returns all rows from TABLE1, and the matching rows from TABLE2. If there's no match, B.SALARY is NULL.
However, the WHERE B.SALARY > 50000 filter removes all rows where B.SALARY is NULL or less than or equal to 50,000. In effect, this turns the LEFT JOIN into an INNER JOIN on the condition that B.SALARY > 50000.
Number of Records Returned
Since TABLE2 has only 100 records, and only those with SALARY > 50000 will be included, the maximum number of records returned is equal to the number of IDs in TABLE1 that have a matching ID in TABLE2 and SALARY > 50000.
- If TABLE1 and TABLE2 have the same IDs, and all
SALARY > 50000, then at most 100 rows. - If only 10 rows in TABLE2 have
SALARY > 50000, only those matching rows from TABLE1 would be returned (up to 10).
So, unless otherwise specified, the answer is: The number of rows in TABLE2 where SALARY > 50000 (assuming each ID is unique and matches in TABLE1).
Modification to the Query
The interviewer likely wants to include rows from TABLE1 even when there's no matching TABLE2 row (i.e., keep B.SALARY as NULL), while still filtering on salary where available.
This can be achieved by moving the B.SALARY > 50000 into the ON clause:
SELECT A.ID, A.AGE, B.SALARY
FROM TABLE1 A
LEFT JOIN TABLE2 B
ON A.ID = B.ID AND B.SALARY > 50000
Now, all 1 million rows from TABLE1 are returned. If there's a matching B.ID with SALARY > 50000, B.SALARY is filled; otherwise, it's NULL.
Key Takeaways
- Filtering in
WHEREafter aLEFT JOINcan turn it into anINNER JOIN. - Putting the condition in the
ONclause preservesLEFT JOINbehavior, returning all rows from the left table.
4. Amazon Coding Interview: Aggregating Data from a Large CSV File
Question:
Given a CSV file with columns ID and Quantity (50 million records, 2GB file size), write a program in any language to aggregate (sum) the Quantity column.
Understanding the Problem
We need to efficiently process a large CSV file and compute the sum of the Quantity column. This is a classic case for streaming the file line by line to keep memory usage low.
Python Solution
import csv
def aggregate_quantity(file_path):
total_quantity = 0
with open(file_path, 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
total_quantity += int(row['Quantity'])
return total_quantity
# Example usage:
# print(aggregate_quantity('large_file.csv'))
Explanation
csv.DictReaderreads the file row by row.- We sum the
Quantityfield as integers, keeping a running total. - Only a small amount of memory is used regardless of file size.
Optimization Tips
- If the file has no header, use
csv.readerand access by index. - For even faster performance, consider using
pandas.read_csvwithusecolsanditerator=Truefor chunked processing.
Pandas Chunked Solution
import pandas as pd
def aggregate_quantity_pandas(file_path):
total = 0
for chunk in pd.read_csv(file_path, usecols=['Quantity'], chunksize=10**6):
total += chunk['Quantity'].sum()
return total
5. Meta SQL Interview Question: Longest User Visit Streak as of Yesterday
Question:
Given a table with columns user_id and date (the dates users visited the platform), find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.
Understanding the Problem
We need to compute, for each user, the length of their most recent continuous streak of daily visits up to (and including) yesterday.
Key Concepts
- Streak: A sequence of consecutive days of activity.
- We are interested in streaks ending on yesterday’s date.
SQL Solution
Let’s assume the table is called user_visits (user_id, visit_date).
WITH sorted_visits AS (
SELECT
user_id,
visit_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY visit_date DESC) AS rn
FROM user_visits
WHERE visit_date <= CURRENT_DATE - INTERVAL '1 day'
),
streaks AS (
SELECT
user_id,
visit_date,
rn,
DATE_SUB(visit_date, INTERVAL rn DAY) AS streak_group
FROM sorted_visits
),
user_streaks AS (
SELECT
user_id,
COUNT(*) AS streak_length
FROM streaks
WHERE streak_group = (
SELECT DATE_SUB(CURRENT_DATE - INTERVAL '1 day', INTERVAL rn DAY)
FROM sorted_visits
WHERE user_id = streaks.user_id AND rn = 1
)
GROUP BY user_id
)
SELECT user_id, streak_length
FROM user_streaks
ORDER BY streak_length DESC
LIMIT 100;
Explanation
- sorted_visits: Assigns a row number to each visit per user, ordered descending by date, only up to yesterday.
- streaks: Calculates a “group” identifier by subtracting the row number from the date — consecutive dates will have the same
streak_group. - user_streaks: For each user, counts the number in the most recent group (i.e., the current streak ending yesterday).
This approach isolates the current streak and returns the 100 users with the longest streaks.
6. Meta SQL Interview Question: Tracking Feature Usage Over Time
Question:
Given a table that tracks every time a user turns a feature on or off with columns: user_id, action ("on" or "off"), date, and time:
- How many users turned the feature on today?
- How many users have ever turned the feature on?
- In a table that tracks the status of every user every day, how would you add today’s data to it?
Understanding the Problem
- Turning on today: Users who performed an "on" action today (distinct users).
- Ever turned on: Users with at least one "on" action at any time.
- Daily
Daily Status Tracking: How to Update the Status Table
Suppose you have a table feature_usage with (
user_id,action,date,time), and you maintain a table user_feature_daily_status with (user_id,date,status), wherestatusindicates whether the feature was ON or OFF at the end of each day for every user.1. How many users turned the feature on today?
SELECT COUNT(DISTINCT user_id) AS num_users_turned_on_today FROM feature_usage WHERE action = 'on' AND date = CURRENT_DATE;This query simply counts distinct users who performed the 'on' action today.
2. How many users have ever turned the feature on?
SELECT COUNT(DISTINCT user_id) AS num_users_ever_turned_on FROM feature_usage WHERE action = 'on';This query counts distinct users who have at any time turned the feature on.
3. Adding Today's Status to the Daily Status Table
To update the
user_feature_daily_statustable for today, determine for each user whether the feature is ON or OFF at the end of today. This typically means:- For each user, find the latest action (by date and time) as of today.
- If the latest action is 'on', status is ON; if 'off', status is OFF.
WITH latest_action AS ( SELECT user_id, action, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY date DESC, time DESC ) AS rn FROM feature_usage WHERE date = CURRENT_DATE ) INSERT INTO user_feature_daily_status (user_id, date, status) SELECT user_id, CURRENT_DATE, action AS status FROM latest_action WHERE rn = 1;This CTE finds the most recent action per user for today, and inserts the status into the daily tracking table.
Key Concepts Explained
1. LEFT JOIN vs INNER JOIN in SQL
A LEFT JOIN returns all records from the left table and matched records from the right table. If there's no match, the result is NULL on the right side. An INNER JOIN returns only records with matches in both tables.
LEFT JOIN INNER JOIN All rows from left, plus matched from right (NULL if no match) Only rows where there is a match in both tables Can result in more rows (including NULLs) Never includes rows without a match 2. Aggregating Large Datasets
When working with very large datasets (like 50 million records), streaming the data line by line or processing in chunks (e.g., using
pandaswithchunksize) is critical for efficient memory usage and speed.3. Calculating Streaks in SQL
A common technique for finding streaks is to use the difference between date and row number. For example, for consecutive dates, the result of
date - ROW_NUMBER()is constant, so you can group on this value to identify streaks.4. Maintaining Daily Status Tables
To track daily user status, always use the latest action for that day. If there are multiple actions, pick the one with the latest timestamp.
Sample Datasets for Practice
Here are a few example datasets you can use to practice these interview questions:
Friend_request requester_id sent_to_id time 1 2 2024-06-01 10:00:00 2 3 2024-06-01 11:00:00 Request_accepted acceptor_id requestor_id time 2 1 2024-06-01 10:30:00
Summary Table: Top 5 Questions and Key Concepts
Company Question Topic Key SQL/Python Concept What It Tests Meta Acceptance Rate Aggregation, Subqueries, LEFT JOIN Basic aggregation, join logic Meta Most Friends UNION, GROUP BY, COUNT(DISTINCT) Set manipulation, counting relationships Amazon LEFT JOIN & WHERE Clause JOIN types, WHERE vs ON clause Effect of filtering on joins Amazon Large CSV Aggregation File streaming, aggregation Efficiency, Python coding Meta Longest Streak Window functions, streak grouping Advanced SQL, analytics Meta Feature Usage Events Window functions, status tracking Temporal analytics, ETL logic
Interview Tips for Data Science Positions at Amazon & Meta
- Master SQL fundamentals. Practice subqueries, window functions, and different types of joins.
- Understand business context. Know why you’re aggregating, which denominator to use, and how to interpret results.
- Think about edge cases. Null values, duplicate rows, and non-matching keys are common pitfalls.
- Communicate clearly. Explain each step of your logic. Your approach matters as much as the final answer.
- Be efficient with code. For coding questions, show that you can process large data efficiently (streaming, chunking, etc.).
- Practice with real datasets. Use open datasets to simulate the scale and complexity of the interview environment.
Conclusion
The data scientist interview process at Amazon and Meta is rigorous and highly competitive. By mastering these top interview questions—ranging from SQL aggregations and joins to streak calculations, feature tracking, and big data processing—you’ll be well-prepared for the technical challenges ahead. Remember, these questions not only test your technical skills but also your analytical thinking and communication. Practice regularly, focus on understanding the underlying business logic, and you'll be set for success in your next data science interview.
Best of luck on your data science journey!
