blog-cover-image

Meta Data Scientist SQL Interview: Sample Question and Solution

In this article, we will tackle a real-world Meta Data Scientist SQL interview question. We will break down the requirements, explain all relevant concepts, and craft an optimal SQL solution. By the end, you'll thoroughly understand how to approach similar SQL interview scenarios and demonstrate your analytical prowess.

Meta Data Scientist SQL Interview Question

Question Breakdown

Let's start by understanding the question in detail:

  • There is a table of time, post_id, action, and content.
    • The action column can include "reported" if content is flagged as spam.
  • Another table of time, post_id, user — of all posts that were removed manually.
  • The question: What percent of yesterday's content views were on content that has been reported for spam and removed yesterday?

Understanding the Data Structures

Before diving into the solution, let's model the tables based on the description:

Table Name Column Name Description
content_actions time Timestamp of the action (view, report, etc.)
post_id Unique identifier for the post
action Type of user action (e.g., 'view', 'report')
content Text or body of the post
removed_posts time Timestamp when the post was removed
post_id Unique identifier for the post
user User who removed the post

We assume that content views are tracked in the content_actions table with action = 'view'. Reports are tracked in the same table with action = 'report'. The removed_posts table tracks when a post is removed and by whom.


Clarifying the Requirements

Let's restate the key requirement:

  • Of all content views yesterday, what percent were on content that was reported for spam and was also removed yesterday?

In mathematical terms, we are looking for:

$$ \text{Percentage} = \frac{\text{Number of 'yesterday' content views on posts reported for spam and removed yesterday}}{\text{Total number of 'yesterday' content views}} \times 100 $$

Step-by-Step Solution Plan

  1. Identify all content views that occurred yesterday.
    • Filter content_actions for action = 'view' and time within yesterday's date.
  2. Find posts that were reported for spam yesterday.
    • Filter content_actions for action = 'report' and time within yesterday's date.
  3. Find posts that were removed yesterday.
    • Filter removed_posts for time within yesterday's date.
  4. Find the intersection: posts that were both reported for spam and removed yesterday.
    • Join the set of reported posts and removed posts on post_id.
  5. Of all content views that occurred yesterday, count how many were on posts in the intersection.
    • Join or filter the views based on the intersection set.
  6. Calculate the percentage.

Step 1: Identify All Content Views from Yesterday

We need to filter the content_actions table for views that happened yesterday. We'll assume the database uses UTC and the time column is a timestamp.


SELECT * 
FROM content_actions
WHERE action = 'view'
  AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'

This query returns all views that happened yesterday.


Step 2: Find Posts Reported for Spam Yesterday

We filter the content_actions table for spam reports from yesterday. Spam reports are records where action = 'report'.


SELECT DISTINCT post_id
FROM content_actions
WHERE action = 'report'
  AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'

We use DISTINCT in case a post was reported multiple times.


Step 3: Find Posts Removed Yesterday

We filter removed_posts for removals that happened yesterday:


SELECT DISTINCT post_id
FROM removed_posts
WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'

Again, DISTINCT is used in case of multiple removal actions for the same post.


Step 4: Posts Reported for Spam and Removed Yesterday

We need the intersection of the two sets from Steps 2 and 3. These are the posts that were BOTH reported for spam AND removed yesterday.


SELECT r.post_id
FROM (
  SELECT DISTINCT post_id
  FROM content_actions
  WHERE action = 'report'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS reported
INNER JOIN (
  SELECT DISTINCT post_id
  FROM removed_posts
  WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
) AS removed
ON reported.post_id = removed.post_id

This gives us the set of post_ids that satisfy both conditions.


Step 5: Count Yesterday's Views on These Posts

Now, we will count how many views yesterday were on these intersected posts.


SELECT COUNT(*) AS views_on_reported_and_removed
FROM content_actions
WHERE action = 'view'
  AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
  AND post_id IN (
    SELECT r.post_id
    FROM (
      SELECT DISTINCT post_id
      FROM content_actions
      WHERE action = 'report'
        AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
    ) AS reported
    INNER JOIN (
      SELECT DISTINCT post_id
      FROM removed_posts
      WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
    ) AS removed
    ON reported.post_id = removed.post_id
  )

This query counts all yesterday's content views that were on posts both reported for spam and removed yesterday.


Step 6: Count All Content Views from Yesterday

As obtained earlier:


SELECT COUNT(*) AS total_yesterday_views
FROM content_actions
WHERE action = 'view'
  AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'

Step 7: Calculate the Percentage

We now combine the two counts to compute the percentage. In SQL, you can do this in a single query using CTEs (Common Table Expressions):


WITH
reported AS (
  SELECT DISTINCT post_id
  FROM content_actions
  WHERE action = 'report'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
  SELECT DISTINCT post_id
  FROM removed_posts
  WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
  SELECT reported.post_id
  FROM reported
  INNER JOIN removed
    ON reported.post_id = removed.post_id
),
yesterday_views AS (
  SELECT *
  FROM content_actions
  WHERE action = 'view'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
  100.0 * SUM(
    CASE WHEN y.post_id IN (SELECT post_id FROM reported_and_removed)
         THEN 1 ELSE 0 END
  ) / COUNT(*) AS percent_reported_and_removed_viewed
FROM yesterday_views y

This query does the following:

  • Identifies posts reported for spam yesterday.
  • Identifies posts removed yesterday.
  • Finds the intersection (posts both reported and removed).
  • Filters all content views from yesterday.
  • Calculates the percentage of these views on intersected posts.

Explanation of Concepts Involved

1. Filtering by Date

Using DATE(time) = CURRENT_DATE - INTERVAL '1 day' ensures only records from "yesterday" are included. This is a common SQL pattern for date-based filtering.

2. DISTINCT Keyword

Ensures that each post_id is counted only once in the sets of reported and removed posts, avoiding duplicates due to multiple actions.

3. INNER JOIN for Set Intersection

The INNER JOIN operation finds the intersection of reported and removed posts, ensuring only posts that appear in both sets are included.

4. Subqueries and CTEs

Common Table Expressions (CTEs) make complex queries more readable and maintainable. In our solution, we use CTEs to modularize each logical step.

5. Aggregation and CASE Statement

The SUM(CASE...END) pattern allows us to count only those views that meet the specific criteria, while COUNT(*) gives the total number of views.

6. Division and Multiplication for Percentage

The final percentage is calculated as:

$$ \text{Percent} = \frac{\text{Views on reported & removed posts}}{\text{Total views}} \times 100 $$

 


Complete SQL Solution


WITH
reported AS (
  SELECT DISTINCT post_id
  FROM content_actions
  WHERE action = 'report'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
  SELECT DISTINCT post_id
  FROM removed_posts
  WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
  SELECT reported.post_id
  FROM reported
  INNER JOIN removed
    ON reported.post_id = removed.post_id
),
yesterday_views AS (
  SELECT *
  FROM content_actions
  WHERE action = 'view'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
  ROUND(
    100.0 * SUM(
      CASE WHEN y.post_id IN (SELECT post_id FROM reported_and_removed)
           THEN 1 ELSE 0 END
    ) / NULLIF(COUNT(*),0),
    2
  ) AS percent_reported_and_removed_viewed
FROM yesterday_views y

The NULLIF function ensures that division by zero is avoided if there are no views yesterday. The ROUND(..., 2) function rounds the percentage to two decimal places for presentation.


Alternative Approaches & Performance Considerations

Using JOIN Instead of IN

Using IN with a subquery can be less performant on large datasets compared to a JOIN. Here's how you might rewrite the main calculation with a JOIN:


WITH
reported AS (
  SELECT DISTINCT post_id
  FROM content_actions
  WHERE action = 'report'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
removed AS (
  SELECT DISTINCT post_id
  FROM removed_posts
  WHERE DATE(time) = CURRENT_DATE - INTERVAL '1 day'
),
reported_and_removed AS (
  SELECT reported.post_id
  FROM reported
  INNER JOIN removed
    ON reported.post_id = removed.post_id
),
yesterday_views AS (
  SELECT *
  FROM content_actions
  WHERE action = 'view'
    AND DATE(time) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
  ROUND(
    100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
    2
  ) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
  ON y.post_id = rar.post_id

This approach counts the number of joined reported_and_removed post_ids (i.e., views on posts meeting both conditions), divided by total views.


Potential Edge Cases and Interview Discussion Points

  • Timezone Handling: Always clarify with your interviewer if time is stored in UTC or a local timezone. Adjust your DATE filtering logic accordingly.
  • Multiple Reports or Removals: The use of DISTINCT ensures posts are not double-counted. Discuss with your interviewer if further deduplication is needed based on businesslogic or if posts can be reported and removed multiple times within the day. It's important to confirm that each post is counted only once per condition (reported and removed yesterday), regardless of the number of actions.
  • Definition of "Spam": In the absence of a specific "spam" flag, we assumed action = 'report' corresponds to a spam report. In a real interview, clarify if there are other report types or a specific field indicating spam.
  • Empty Result Sets: Always account for the scenario where there are no views, no reports, or no removals yesterday. Using NULLIF prevents division by zero errors in your calculation.
  • Performance on Large Datasets: For very large tables, ensure that the relevant columns (time, action, post_id) are indexed to optimize query execution. CTEs and subqueries are easier to read but can be less efficient than temporary tables or materialized views in production environments.
  • Data Consistency: Confirm that the post_id is the unique key across both tables and that there are no type mismatches or casing issues that could affect joins.

Extending the Problem: What If Questions

What if the Report and Removal Dates Differ?

If you were asked to consider posts reported for spam at any time in the past but removed yesterday, you would adjust the "reported" CTE to include all previous dates:


reported AS (
  SELECT DISTINCT post_id
  FROM content_actions
  WHERE action = 'report'
    -- No date constraint, includes all past reports
)

What if You Need the Count, Not the Percentage?

You can modify the final SELECT to return the numerator and denominator for transparency:


SELECT
  COUNT(rar.post_id) AS views_on_reported_and_removed,
  COUNT(y.post_id) AS total_views,
  ROUND(
    100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
    2
  ) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
  ON y.post_id = rar.post_id

What if You Need the Breakdown by Hour?

Add a GROUP BY clause:


SELECT
  EXTRACT(HOUR FROM y.time) AS hour,
  COUNT(rar.post_id) AS views_on_reported_and_removed,
  COUNT(y.post_id) AS total_views,
  ROUND(
    100.0 * COUNT(rar.post_id) / NULLIF(COUNT(y.post_id),0),
    2
  ) AS percent_reported_and_removed_viewed
FROM yesterday_views y
LEFT JOIN reported_and_removed rar
  ON y.post_id = rar.post_id
GROUP BY hour
ORDER BY hour

This provides hourly percentages, which can offer deeper insights into user behavior and moderation efficiency.


Common Mistakes to Avoid

  • Not Filtering Both Actions by Date: Ensure that both the report and removal happened yesterday, as per the question's requirements.
  • Double Counting Views: Only count each view once, even if the post was reported or removed multiple times.
  • Ignoring Time Zones: Date filtering can be incorrect if time zones are not aligned between application logic and the database.
  • Using IN with Large Datasets: For large intersections, prefer JOIN over IN to improve performance.
  • Not Handling Division by Zero: Always use NULLIF or equivalent to avoid errors if there are no views.

How to Explain This Solution in an Interview

When presenting your approach to the interviewer, consider the following structure:

  • Clarify your understanding: Restate the question to confirm the requirements.
  • Break the problem down: Outline your step-by-step plan, as we did above.
  • Discuss edge cases: Mention potential pitfalls and how you address them (e.g., empty datasets, time zones).
  • Write clean, modular SQL: Use CTEs for readability and maintainability.
  • Optimize for performance: Point out any indexes or alternative approaches for large datasets.
  • Test your logic: Suggest running the query with test data or explain how you would validate correctness.

For example, you might say:

"I'll filter for all views that occurred yesterday, then identify posts that were both reported for spam and removed yesterday. By joining these sets, I can count the number of relevant views and divide by the total number of yesterday's views to get the percentage. I'll make sure to handle cases where there are no views and will use CTEs for clarity."

Sample Data and Worked Example

Let's demonstrate the approach with a simple example.

content_actions time post_id action content
  2024-06-10 10:00:00 101 view Post A
2024-06-10 10:05:00 102 view Post B
2024-06-10 11:00:00 101 report Post A
2024-06-10 12:00:00 103 view Post C
2024-06-10 13:00:00 104 view Post D
2024-06-10 14:00:00 101 view Post A
removed_posts time post_id user
  2024-06-10 15:00:00 101 mod1
  2024-06-10 16:00:00 105 mod2

Assume today's date is 2024-06-11. "Yesterday" is 2024-06-10.

  • Step 1: All views on 2024-06-10: post_ids 101 (twice), 102, 103, 104 — 5 views.
  • Step 2: Posts reported for spam on 2024-06-10: post_id 101.
  • Step 3: Posts removed on 2024-06-10: post_ids 101, 105.
  • Step 4: Intersection: post_id 101 (both reported and removed).
  • Step 5: Views on post_id 101: 2 views.
  • Step 6: Percentage = (2 / 5) * 100 = 40%

This matches our expected output. The query, when run on this data, would return 40.


Best Practices for SQL Interview Questions

  • Clarify ambiguous requirements with your interviewer before proceeding.
  • Break the problem into logical, testable steps—use CTEs for clarity.
  • Explain your reasoning out loud, including how you handle edge cases.
  • Consider performance implications and discuss potential indexing or query optimization strategies.
  • Validate your solution with sample data if possible.

Conclusion

SQL questions like this one are designed to test your ability to translate business requirements into precise, efficient queries. By systematically breaking down the problem, understanding the data model, and building your solution step by step, you can confidently answer even the most challenging interview scenarios.

Remember to always clarify requirements, consider edge cases, and explain your thought process. Mastering these skills will help you excel in data scientist interviews at Meta and other top tech companies.

Use the approaches and best practices detailed in this article as a template for tackling real-world data analysis questions, both in interviews and on the job. Good luck!

Related Articles