
Meta Data Scientist SQL Interview: Sample Question and Solution
In this article, we will tackle a real-world Meta Data Scientist SQL interview question. We will break down the requirements, explain all relevant concepts, and craft an optimal SQL solution. By the end, you'll thoroughly understand how to approach similar SQL interview scenarios and demonstrate your analytical prowess.
Meta Data Scientist SQL Interview Question
Question Breakdown
Let's start by understanding the question in detail:
- There is a table of
time,post_id,action, andcontent.- The
actioncolumn can include "reported" if content is flagged as spam.
- The
- Another table of
time,post_id,user— of all posts that were removed manually. - The question: What percent of yesterday's content views were on content that has been reported for spam and removed yesterday?
Understanding the Data Structures
Before diving into the solution, let's model the tables based on the description:
| Table Name | Column Name | Description |
|---|---|---|
| content_actions | time | Timestamp of the action (view, report, etc.) |
| post_id | Unique identifier for the post | |
| action | Type of user action (e.g., 'view', 'report') | |
| content | Text or body of the post | |
| removed_posts | time | Timestamp when the post was removed |
| post_id | Unique identifier for the post | |
| user | User who removed the post |
We assume that content views are tracked in the content_actions table with action = 'view'. Reports are tracked in the same table with action = 'report'. The removed_posts table tracks when a post is removed and by whom.
Clarifying the Requirements
Let's restate the key requirement:
- Of all content views yesterday, what percent were on content that was reported for spam and was also removed yesterday?
In mathematical terms, we are looking for:
Step-by-Step Solution Plan
- Identify all content views that occurred yesterday.
- Filter
content_actionsforaction = 'view'andtimewithin yesterday's date.
- Filter
- Find posts that were reported for spam yesterday.
- Filter
content_actionsforaction = 'report'andtimewithin yesterday's date.
- Filter
- Find posts that were removed yesterday.
- Filter
removed_postsfortimewithin yesterday's date.
- Filter
- Find the intersection: posts that were both reported for spam and removed yesterday.
- Join the set of reported posts and removed posts on
post_id.
- Join the set of reported posts and removed posts on
- Of all content views that occurred yesterday, count how many were on posts in the intersection.
- Join or filter the views based on the intersection set.
- Calculate the percentage.
Step 1: Identify All Content Views from Yesterday
We need to filter the content_actions table for views that happened yesterday. We'll assume the database uses UTC and the time column is a timestamp.
SELECT *
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
This query returns all views that happened yesterday.
Step 2: Find Posts Reported for Spam Yesterday
We filter the content_actions table for spam reports from yesterday. Spam reports are records where action = 'report'.
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
We use DISTINCT in case a post was reported multiple times.
Step 3: Find Posts Removed Yesterday
We filter removed_posts for removals that happened yesterday:
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
Again, DISTINCT is used in case of multiple removal actions for the same post.
Step 4: Posts Reported for Spam and Removed Yesterday
We need the intersection of the two sets from Steps 2 and 3. These are the posts that were BOTH reported for spam AND removed yesterday.
SELECT r.post_id
FROM (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS reported
INNER JOIN (
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS removed
ON reported.post_id = removed.post_id
This gives us the set of post_ids that satisfy both conditions.
Step 5: Count Yesterday's Views on These Posts
Now, we will count how many views yesterday were on these intersected posts.
SELECT COUNT(*) AS views_on_reported_and_removed
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
AND post_id IN (
SELECT r.post_id
FROM (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS reported
INNER JOIN (
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS removed
ON reported.post_id = removed.post_id
)
This query counts all yesterday's content views that were on posts both reported for spam and removed yesterday.
Step 6: Count All Content Views from Yesterday
As obtained earlier:
SELECT COUNT(*) AS total_yesterday_views
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
Step 7: Calculate the Percentage
We now combine the two counts to compute the percentage. In SQL, you can do this in a single query using CTEs (Common Table Expressions):
WITH
reported AS (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
SELECT reported.post_id
FROM reported
INNER JOIN removed
ON reported.post_id = removed.post_id
),
yesterday_views AS (
SELECT *
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
100.0 * SUM(
CASE WHEN y.post_id IN (SELECT post_id FROM reported_and_removed)
THEN 1 ELSE 0 END
) / COUNT(*) AS percent_reported_and_removed_viewed
FROM yesterday_views y
This query does the following:
- Identifies posts reported for spam yesterday.
- Identifies posts removed yesterday.
- Finds the intersection (posts both reported and removed).
- Filters all content views from yesterday.
- Calculates the percentage of these views on intersected posts.
Explanation of Concepts Involved
1. Filtering by Date
Using DATE(time) = CURRENT_DATE - INTERVAL '1 day' ensures only records from "yesterday" are included. This is a common SQL pattern for date-based filtering.
2. DISTINCT Keyword
Ensures that each post_id is counted only once in the sets of reported and removed posts, avoiding duplicates due to multiple actions.
3. INNER JOIN for Set Intersection
The INNER JOIN operation finds the intersection of reported and removed posts, ensuring only posts that appear in both sets are included.
4. Subqueries and CTEs
Common Table Expressions (CTEs) make complex queries more readable and maintainable. In our solution, we use CTEs to modularize each logical step.
5. Aggregation and CASE Statement
The SUM(CASE...END) pattern allows us to count only those views that meet the specific criteria, while COUNT(*) gives the total number of views.
6. Division and Multiplication for Percentage
The final percentage is calculated as:
Complete SQL Solution
WITH
reported AS (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
SELECT reported.post_id
FROM reported
INNER JOIN removed
ON reported.post_id = removed.post_id
),
yesterday_views AS (
SELECT *
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
ROUND(
100.0 * SUM(
CASE WHEN y.post_id IN (SELECT post_id FROM reported_and_removed)
THEN 1 ELSE 0 END
) / NULLIF(COUNT(*),0),
2
) AS percent_reported_and_removed_viewed
FROM yesterday_views y
The NULLIF function ensures that division by zero is avoided if there are no views yesterday. The ROUND(..., 2) function rounds the percentage to two decimal places for presentation.
Alternative Approaches & Performance Considerations
Using JOIN Instead of IN
Using IN with a subquery can be less performant on large datasets compared to a JOIN. Here's how you might rewrite the main calculation with a JOIN:
WITH
reported AS (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
SELECT reported.post_id
FROM reported
INNER JOIN removed
ON reported.post_id = removed.post_id
),
yesterday_views AS (
SELECT *
FROM content_actions
WHERE action = 'view'
AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
ROUND(
100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
2
) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
ON y.post_id = rar.post_id
This approach counts the number of joined reported_and_removed post_ids (i.e., views on posts meeting both conditions), divided by total views.
Potential Edge Cases and Interview Discussion Points
- Timezone Handling: Always clarify with your interviewer if
timeis stored in UTC or a local timezone. Adjust yourDATEfiltering logic accordingly. - Multiple Reports or Removals: The use of
DISTINCTensures posts are not double-counted. Discuss with your interviewer if further deduplication is needed based on businesslogic or if posts can be reported and removed multiple times within the day. It's important to confirm that each post is counted only once per condition (reported and removed yesterday), regardless of the number of actions. - Definition of "Spam": In the absence of a specific "spam" flag, we assumed
action = 'report'corresponds to a spam report. In a real interview, clarify if there are other report types or a specific field indicating spam. - Empty Result Sets: Always account for the scenario where there are no views, no reports, or no removals yesterday. Using
NULLIFprevents division by zero errors in your calculation. - Performance on Large Datasets: For very large tables, ensure that the relevant columns (
time,action,post_id) are indexed to optimize query execution. CTEs and subqueries are easier to read but can be less efficient than temporary tables or materialized views in production environments. - Data Consistency: Confirm that the
post_idis the unique key across both tables and that there are no type mismatches or casing issues that could affect joins.
Extending the Problem: What If Questions
What if the Report and Removal Dates Differ?
If you were asked to consider posts reported for spam at any time in the past but removed yesterday, you would adjust the "reported" CTE to include all previous dates:
reported AS (
SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
-- No date constraint, includes all past reports
)
What if You Need the Count, Not the Percentage?
You can modify the final SELECT to return the numerator and denominator for transparency:
SELECT
COUNT(rar.post_id) AS views_on_reported_and_removed,
COUNT(y.post_id) AS total_views,
ROUND(
100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
2
) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
ON y.post_id = rar.post_id
What if You Need the Breakdown by Hour?
Add a GROUP BY clause:
SELECT
EXTRACT(HOUR FROM y.time) AS hour,
COUNT(rar.post_id) AS views_on_reported_and_removed,
COUNT(y.post_id) AS total_views,
ROUND(
100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
2
) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
ON y.post_id = rar.post_id
GROUP BY hour
ORDER BY hour
This provides hourly percentages, which can offer deeper insights into user behavior and moderation efficiency.
Common Mistakes to Avoid
- Not Filtering Both Actions by Date: Ensure that both the report and removal happened yesterday, as per the question's requirements.
- Double Counting Views: Only count each view once, even if the post was reported or removed multiple times.
- Ignoring Time Zones: Date filtering can be incorrect if time zones are not aligned between application logic and the database.
- Using
INwith Large Datasets: For large intersections, preferJOINoverINto improve performance. - Not Handling Division by Zero: Always use
NULLIFor equivalent to avoid errors if there are no views.
How to Explain This Solution in an Interview
When presenting your approach to the interviewer, consider the following structure:
- Clarify your understanding: Restate the question to confirm the requirements.
- Break the problem down: Outline your step-by-step plan, as we did above.
- Discuss edge cases: Mention potential pitfalls and how you address them (e.g., empty datasets, time zones).
- Write clean, modular SQL: Use CTEs for readability and maintainability.
- Optimize for performance: Point out any indexes or alternative approaches for large datasets.
- Test your logic: Suggest running the query with test data or explain how you would validate correctness.
For example, you might say:
"I'll filter for all views that occurred yesterday, then identify posts that were both reported for spam and removed yesterday. By joining these sets, I can count the number of relevant views and divide by the total number of yesterday's views to get the percentage. I'll make sure to handle cases where there are no views and will use CTEs for clarity."
Sample Data and Worked Example
Let's demonstrate the approach with a simple example.
| content_actions | time | post_id | action | content |
|---|---|---|---|---|
| 2024-06-10 10:00:00 | 101 | view | Post A | |
| 2024-06-10 10:05:00 | 102 | view | Post B | |
| 2024-06-10 11:00:00 | 101 | report | Post A | |
| 2024-06-10 12:00:00 | 103 | view | Post C | |
| 2024-06-10 13:00:00 | 104 | view | Post D | |
| 2024-06-10 14:00:00 | 101 | view | Post A |
| removed_posts | time | post_id | user |
|---|---|---|---|
| 2024-06-10 15:00:00 | 101 | mod1 | |
| 2024-06-10 16:00:00 | 105 | mod2 |
Assume today's date is 2024-06-11. "Yesterday" is 2024-06-10.
- Step 1: All views on 2024-06-10: post_ids 101 (twice), 102, 103, 104 — 5 views.
- Step 2: Posts reported for spam on 2024-06-10: post_id 101.
- Step 3: Posts removed on 2024-06-10: post_ids 101, 105.
- Step 4: Intersection: post_id 101 (both reported and removed).
- Step 5: Views on post_id 101: 2 views.
- Step 6: Percentage = (2 / 5) * 100 = 40%
This matches our expected output. The query, when run on this data, would return 40.
Best Practices for SQL Interview Questions
- Clarify ambiguous requirements with your interviewer before proceeding.
- Break the problem into logical, testable steps—use CTEs for clarity.
- Explain your reasoning out loud, including how you handle edge cases.
- Consider performance implications and discuss potential indexing or query optimization strategies.
- Validate your solution with sample data if possible.
Conclusion
SQL questions like this one are designed to test your ability to translate business requirements into precise, efficient queries. By systematically breaking down the problem, understanding the data model, and building your solution step by step, you can confidently answer even the most challenging interview scenarios.
Remember to always clarify requirements, consider edge cases, and explain your thought process. Mastering these skills will help you excel in data scientist interviews at Meta and other top tech companies.
Use the approaches and best practices detailed in this article as a template for tackling real-world data analysis questions, both in interviews and on the job. Good luck!