blog-cover-image

LinkedIn Data Scientist Interview: Real Streaming Data Question

In this article, we tackle two actual interview problems: one from LinkedIn, which focuses on streaming data and frequency analysis, and another from Meta (Facebook), which involves SQL logic for deducing social network relationships from multiple tables. We provide detailed explanations, code implementations, and relevant mathematical concepts to help you understand and solve these questions effectively.

LinkedIn Data Scientist Interview Question: K Most Frequent Numbers from a Stream

Problem Statement

Given an incoming stream of numbers, design an algorithm to find the k most frequent numbers at any point in the stream. This must be done "on the fly", meaning that the solution should be able to process numbers as they come in, preferably in sub-linear time and space relative to the stream size.


Understanding the Problem

Let's break down the requirements:

  • Stream: Numbers arrive one by one and are potentially unbounded in length.
  • On the fly: We must be able to provide the top-k frequent numbers at any time, not just after the stream ends.
  • Efficiency: We must minimize both the time per incoming number and the total memory usage. Storing the entire stream is not scalable.

We will discuss both the naive and optimal approaches, with Python code examples.

 

Naive Approach: Full Counting

A straightforward solution is to maintain a hash map (dictionary) that counts occurrences of each number. To get the top-k frequent numbers, we sort the hash map by value and pick the top k.


from collections import defaultdict
import heapq

class NaiveTopKStream:
    def __init__(self, k):
        self.k = k
        self.counts = defaultdict(int)

    def add(self, num):
        self.counts[num] += 1

    def get_top_k(self):
        # Sort all items by frequency, descending
        return heapq.nlargest(self.k, self.counts.items(), key=lambda x: x[1])

Drawbacks:

  • Space: O(n) where n is the number of unique elements.
  • Time: O(n log k) to get the top-k elements (using a heap), which may be slow for large n.

 

Optimal Approach: Min-Heap of Size k

To improve efficiency, we can use a min-heap of size k to track the top-k frequent elements. The heap always contains the current k most frequent elements.

The steps are:

  • Maintain a hash map of number frequencies.
  • Maintain a min-heap of size k containing tuples (frequency, number).
  • When a new number arrives, increment its count in the hash map. If it's in the heap, update its frequency.
  • If it's not in the heap and the heap has less than k elements, add it.
  • If it's not in the heap and the heap is full, compare its frequency to the root of the min-heap. If its frequency is larger, pop the root and insert the new number.

 


import heapq

class TopKFrequentStream:
    def __init__(self, k):
        self.k = k
        self.counts = {}
        self.heap = []

    def add(self, num):
        self.counts[num] = self.counts.get(num, 0) + 1
        # Check if num is already in heap
        for i, (freq, val) in enumerate(self.heap):
            if val == num:
                self.heap[i] = (self.counts[num], num)
                heapq.heapify(self.heap)
                return
        if len(self.heap) < self.k:
            heapq.heappush(self.heap, (self.counts[num], num))
        else:
            if self.counts[num] > self.heap[0][0]:
                heapq.heappop(self.heap)
                heapq.heappush(self.heap, (self.counts[num], num))

    def get_top_k(self):
        # Return sorted list
        return sorted(self.heap, key=lambda x: -x[0])

Advantages:

  • Space: O(k) for the heap, O(n) for the hash map.
  • Time: O(log k) per insertion into the heap.

 

What if the Stream is Very Large? (Count-Min Sketch)

For extremely large streams, even storing the count for every unique number is infeasible. In such cases, algorithms like the Count-Min Sketch are used, which approximate frequencies using hash functions and fixed-size arrays.

Count-Min Sketch: Mathematical Explanation

The Count-Min Sketch is a probabilistic data structure that uses multiple hash functions to map each element to several "bins". The structure is a matrix of size d \times w , where:

  • d is the number of hash functions (rows)
  • w is the width of each row (number of bins per hash function)

For each incoming number x , we increment count[i][hash_i(x)] for each row i .

 

To estimate the frequency of x , we compute:

f_x = \min_{i=1}^{d} count[i][hash_i(x)]

Count-Min Sketch allows fast approximate frequency queries with controlled error bounds.

Time and Space Complexity Analysis

Approach Space Complexity Time per Insert Top-K Query Time Accuracy
Naive Counting O(n) O(1) O(n log k) Exact
Min-Heap + Hash Map O(n + k) O(log k) O(k) Exact
Count-Min Sketch O(d w) O(d) O(k) Approximate

Interview Tips

  • Clarify if approximate answers are acceptable (for large-scale systems, they often are).
  • Discuss trade-offs between space, time, and accuracy.
  • Mention real-world applications: trending hashtags, popular posts, fraud detection.

Meta Data Scientist Interview Question: Friendship Status from Multiple Tables

Problem Statement

You are given four tables that describe friendship-related actions in a social network:

  • Requests: User A sends a friend request to User B.
  • Accepts: User B accepts a friend request from User A.
  • Rejects: User B rejects a friend request from User A.
  • Removes: User A removes User B from their friend list.

Each table has columns user_id_1 and user_id_2. You are asked: How would you generate all possible current friendships?

 

Understanding Friendship Logic

The goal is to reconstruct the current state of friendships given the history of actions.

  • Request: Not a friendship by itself; only an invitation.
  • Accept: Establishes friendship between two users.
  • Reject: Request is declined; no friendship is formed.
  • Remove: Friendship is broken between two users.

We must account for:

  • Friendship is bidirectional: if A and B are friends, both are friends with each other.
  • Actions can happen multiple times; only the latest action matters.
  • After a friendship is removed, a new request and accept could reestablish it.

 

Step-by-Step SQL Solution

We will assume the following schemas:

  • requests(user_id_1, user_id_2, timestamp)
  • accepts(user_id_1, user_id_2, timestamp)
  • rejects(user_id_1, user_id_2, timestamp)
  • removes(user_id_1, user_id_2, timestamp)

All timestamps are in UTC and indicate when the event occurred.

 

Step 1: Identify All Accepts

Each accept event establishes a friendship between user_id_1 and user_id_2 at a given time.


SELECT user_id_1, user_id_2, timestamp AS friendship_started
FROM accepts

Step 2: Remove Friendships That Were Removed

A remove event breaks a friendship. We must ensure that only accepts after the last removal are counted.

For each pair, if a friendship was accepted, we must check if it was later removed. Only keep the friendship if the latest "accept" is after the latest "remove".


WITH last_accept AS (
    SELECT user_id_1, user_id_2, MAX(timestamp) AS last_accept_time
    FROM accepts
    GROUP BY user_id_1, user_id_2
),
last_remove AS (
    SELECT user_id_1, user_id_2, MAX(timestamp) AS last_remove_time
    FROM removes
    GROUP BY user_id_1, user_id_2
)
SELECT
    la.user_id_1, la.user_id_2
FROM last_accept la
LEFT JOIN last_remove lr
    ON la.user_id_1 = lr.user_id_1
    AND la.user_id_2 = lr.user_id_2
WHERE lr.last_remove_time IS NULL OR la.last_accept_time > lr.last_remove_time

Step 3: Account for Bidirectional Friendships

Since friendship is bidirectional, we must combine (A, B) and (B, A) pairs. We can use LEAST and GREATEST functions (supported in MySQL, PostgreSQL).


WITH last_accept AS (
    SELECT LEAST(user_id_1, user_id_2) AS user_a,
           GREATEST(user_id_1, user_id_2) AS user_b,
           MAX(timestamp) AS last_accept_time
    FROM accepts
    GROUP BY LEAST(user_id_1, user_id_2), GREATEST(user_id_1, user_id_2)
),
last_remove AS (
    SELECT LEAST(user_id_1, user_id_2) AS user_a,
           GREATEST(user_id_1, user_id_2) AS user_b,
           MAX(timestamp) AS last_remove_time
    FROM removes
    GROUP BY LEAST(user_id_1, user_id_2), GREATEST(user_id_1, user_id_2)
)
SELECT
    la.user_a, la.user_b
FROM last_accept la
LEFT JOIN last_remove lr
    ON la.user_a = lr.user_a
    AND la.user_b = lr.user_b
WHERE lr.last_remove_time IS NULL OR la.last_accept_time > lr.last_remove_time

Step 4: Exclude Rejected Requests

If a request was rejected, and not subsequently accepted, that pair should not be considered friends, but since we're basing our logic on accepts and removes, this is already handled as only an accept creates a friendship.

Step 5: Final Query

The final query returns all current friendships as unique pairs.


WITH
last_accept AS (
    SELECT LEAST(user_id_1, user_id_2) AS user_a,
           GREATEST(user_id_1, user_id_2) AS user_b,
           MAX(timestamp) AS last_accept_time
    FROM accepts
    GROUP BY LEAST(user_id_1, user_id_2), GREATEST(user_id_1, user_id_2)
),
last_remove AS (
    SELECT LEAST(user_id_1, user_id_2) AS user_a,
           GREATEST(user_id_1, user_id_2) AS user_b,
           MAX(timestamp) AS last_remove_time
    FROM removes
    GROUP BY LEAST(user_id_1, user_id_2), GREATEST(user_id_1, user_id_2)
)
SELECT
    la.user_a AS user_id_1,
    la.user_b AS user_id_2
FROM last_accept la
LEFT JOIN last_remove lr
    ON la.user_a = lr.user_a
    AND la.user_b = lr.user_b
WHERE lr.last_remove_time IS NULL OR la.last_accept_time > lr.last_remove_time

Visualization Example

Action User 1 User 2 Timestamp
Request A B 2021-01-01
Accept A B 2021-01-02
Remove B A 2021-01-05
Request A B 2021-02-01
Accept A B 2021-02-02

In this example:

  • A and B first become friends on 2021-01-02.
  • B removes A on 2021-01-05, breaking the friendship.
  • A sends another request, and B accepts on 2021-02-02; they become friends again.
  • Since there is no subsequent removal, the final friendship state is that A and B are friends.

 

Extending the Solution: Handling Multiple Requests, Accepts, and Removals

The above SQL logic generalizes well because it always takes the latest accept and remove timestamps per undirected pair. To further ensure correctness, especially with complex histories, remember:

  • Friendship exists if and only if the latest accept (regardless of who initiated) is after the latest remove (if any) for that user pair.
  • Multiple requests and rejects before an accept or after a remove do not affect the friendship status unless followed by an accept.

Alternative: Using Window Functions (For Advanced SQL Engines)

If your SQL engine supports window functions, you can construct the friendship state timeline for each pair:


WITH events AS (
    SELECT LEAST(user_id_1, user_id_2) AS user_a,
           GREATEST(user_id_1, user_id_2) AS user_b,
           timestamp,
           'accept' AS event
    FROM accepts
    UNION ALL
    SELECT LEAST(user_id_1, user_id_2),
           GREATEST(user_id_1, user_id_2),
           timestamp,
           'remove'
    FROM removes
),
ranked_events AS (
    SELECT user_a, user_b, timestamp, event,
           ROW_NUMBER() OVER (PARTITION BY user_a, user_b ORDER BY timestamp DESC) AS rn
    FROM events
)
SELECT user_a, user_b
FROM ranked_events
WHERE rn = 1 AND event = 'accept'

This method considers only the most recent event per user pair and selects pairs where the last event was an accept.

Edge Cases and Interview Discussion Points

  • What if there are two removes after an accept?
    The logic still holds: only the latest accept vs the latest remove matters.
  • Multiple parallel requests and accepts?
    Friendship is established by the latest accept, regardless of multiple requests.
  • Rejected requests?
    Ignored unless followed by an accept.
  • Friendship expiry?
    If the business logic includes expiry (e.g., auto-remove after inactivity), the query can be extended to filter on timestamp.

Full Python Example: Simulating Friendship Status

Here’s how you could simulate the friendship logic in Python:


from collections import defaultdict

# Each event is (action, user1, user2, timestamp)
events = [
    ('request', 'A', 'B', '2021-01-01'),
    ('accept', 'A', 'B', '2021-01-02'),
    ('remove', 'B', 'A', '2021-01-05'),
    ('request', 'A', 'B', '2021-02-01'),
    ('accept', 'A', 'B', '2021-02-02'),
]

# Track for each pair: last accept time, last remove time
friendship = defaultdict(lambda: {'accept': None, 'remove': None})

for action, u1, u2, ts in events:
    key = tuple(sorted([u1, u2]))
    if action == 'accept':
        friendship[key]['accept'] = ts
    elif action == 'remove':
        friendship[key]['remove'] = ts

# Now, who are friends?
for pair, times in friendship.items():
    acc = times['accept']
    rem = times['remove']
    if acc and (not rem or acc > rem):
        print(f"{pair[0]} and {pair[1]} are friends (since {acc})")

Optimization and Real-World Considerations

  • Indexing: Ensure tables are indexed on user_id_1, user_id_2, and timestamp for efficient query execution.
  • Data Volume: For very large social graphs, consider denormalizing current friendship status into a separate table to speed up queries.
  • Privacy: Ensure that friendship status queries respect privacy settings and user blocks.

Summary and Key Takeaways

LinkedIn Data Scientist interview questions often probe your ability to handle large data streams and reason about real-world data relationships. Here are the essential points:

  • For k most frequent numbers from a stream, understand hash maps, heaps, and approximate counting (Count-Min Sketch). Trade-offs between accuracy, speed, and memory are critical.
  • For friendship status from multiple tables (Meta question), master event-based SQL, window functions, and the logic of reconstructing current state from event histories.
  • Always clarify requirements, edge cases, and discuss scalability and optimization in your interview answers.

By practicing these types of problems, you’ll be well-equipped to tackle technical rounds at LinkedIn, Meta, and other top tech companies.

Further Reading

Prepare diligently, understand the underlying concepts, and you’ll excel in your next LinkedIn or Meta data science interview!

Related Articles