blog-cover-image

Advanced SQL for Data Analysis: Mastering CTEs, Window Functions, and Performance Optimization

In the age of data-driven decision-making, SQL remains a cornerstone skill for data scientists, analysts, and engineers. While basic querying is essential, mastering advanced SQL concepts—like Common Table Expressions (CTEs) and window functions—unlocks new levels of analytical power and performance. This comprehensive guide explores these techniques, demonstrates real-world applications, and reveals best practices for optimizing SQL in modern data science workflows.


1. Introduction

Complex data analysis often demands queries that are not only accurate but also maintainable, efficient, and expressive. As datasets grow and business questions become nuanced, subqueries and basic aggregations can quickly become unwieldy. Enter Common Table Expressions (CTEs) and window functions: two advanced SQL features that help streamline code, enable sophisticated analytics, and improve performance when used correctly. In this article, you'll learn how and when to use these tools, illustrated with real-world examples and actionable tips.


2. Common Table Expressions (CTEs) Explained

2.1 What are CTEs?

A Common Table Expression (CTE) is a temporary, named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs are defined using the WITH clause and allow for clearer, modular SQL code—especially when working with complex transformations or recursive queries.

CTE Syntax Breakdown: WITH Clause


WITH cte_name AS (
    -- SQL query
)
SELECT * FROM cte_name;

You can define multiple CTEs by separating them with commas:


WITH first_cte AS (...),
     second_cte AS (...)
SELECT ...
FROM first_cte
JOIN second_cte ON ...

When to Use CTEs vs. Subqueries or Temp Tables

  • Readability: CTEs make queries easier to read, debug, and maintain by breaking complex logic into named blocks.
  • Reusability: Use CTEs when you need to reference the same result set multiple times in a query.
  • Alternatives: Subqueries are better for simple, inline filtering. Temporary tables are preferable when reusing intermediate results across multiple queries or sessions.

2.2 Types of CTEs with Examples

Simple CTE: Organizing Complex Queries


WITH high_value_orders AS (
    SELECT order_id, user_id, amount
    FROM orders
    WHERE amount > 1000
)
SELECT u.name, hvo.amount
FROM users u
JOIN high_value_orders hvo ON u.user_id = hvo.user_id;

Recursive CTE: Hierarchical Data (Org Charts, Categories)

Recursive CTEs allow you to traverse hierarchical structures—think organization charts or product categories—without procedural code.


WITH RECURSIVE org_chart AS (
    SELECT employee_id, manager_id, name, 1 as level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    SELECT e.employee_id, e.manager_id, e.name, oc.level + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart
ORDER BY level, name;

2.3 Real-World Data Science Example: Customer Cohort Analysis

Suppose you want to analyze customer retention by signup month—a common cohort analysis. Here's how a CTE can turn a messy subquery into elegant SQL.

Before: Messy Subquery


SELECT
    DATE_TRUNC('month', u.signup_date) AS cohort_month,
    DATE_TRUNC('month', o.order_date) AS order_month,
    COUNT(DISTINCT o.user_id) AS active_users
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE o.order_date IS NOT NULL
GROUP BY cohort_month, order_month
ORDER BY cohort_month, order_month;

After: Clean CTE Approach


WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', signup_date) AS cohort_month
    FROM users
),
orders_by_month AS (
    SELECT
        user_id,
        DATE_TRUNC('month', order_date) AS order_month
    FROM orders
)
SELECT
    uc.cohort_month,
    obm.order_month,
    COUNT(DISTINCT obm.user_id) AS active_users
FROM user_cohorts uc
JOIN orders_by_month obm ON uc.user_id = obm.user_id
GROUP BY uc.cohort_month, obm.order_month
ORDER BY uc.cohort_month, obm.order_month;

This structure clarifies intent, makes debugging easier, and facilitates further extension.


3. Window Functions Deep Dive

3.1 Window Functions vs. GROUP BY

  • GROUP BY: Aggregates rows to a single summary row per group, collapsing detail.
  • Window Functions: Calculate values across related rows while preserving row-level detail.

For example, to calculate a running total of sales per user:


-- Using GROUP BY (loses detail)
SELECT user_id, SUM(amount) AS total_spent
FROM orders
GROUP BY user_id;

-- Using Window Function (retains each order)
SELECT
    user_id,
    order_id,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total
FROM orders;

3.2 Core Window Function Types

Ranking Functions

  • ROW_NUMBER(): Unique sequential number for each row within a partition.
  • RANK(): Similar, but ties get the same rank and next rank(s) are skipped.
  • DENSE_RANK(): Like RANK(), but no gaps in ranking sequence.

SELECT
    user_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS order_seq
FROM orders;

Analytic Functions

  • LAG(): Access value from previous row in partition.
  • LEAD(): Access value from next row in partition.
  • FIRST_VALUE(), LAST_VALUE(): Retrieve first/last value in window frame.

SELECT
    user_id,
    order_date,
    amount,
    LAG(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS prev_order_amount
FROM orders;

Aggregate Functions with OVER()

  • SUM(), AVG(), COUNT(), etc. can be used as window functions.

SELECT
    user_id,
    amount,
    SUM(amount) OVER (PARTITION BY user_id) AS total_user_spend
FROM orders;

3.3 The OVER() Clause Explained

  • PARTITION BY: Defines groups (partitions) within the dataset.
  • ORDER BY: Specifies order of rows within each partition.
  • ROWS/RANGE: Defines a subset of rows for frame-based calculations (e.g., running totals, moving averages).

Examples:


-- Partition by user, ordered by date
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date)

-- Moving average: past 6 days + current
AVG(amount) OVER (
    PARTITION BY user_id
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

You can express a 7-day moving average mathematically as:

\( \text{MovingAvg}_t = \frac{1}{7} \sum_{i=t-6}^{t} x_i \)


3.4 Advanced Pattern: Running Totals & Moving Averages Example


SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY user_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    AVG(amount) OVER (
        PARTITION BY user_id
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM orders;

4. CTEs + Window Functions: Power Combination

4.1 Solving Complex Problems

Combining CTEs with window functions lets you break down multi-step analytics, such as funnel analysis, retention curves, or rolling user engagement, into clear, modular stages.

4.2 Step-by-Step Walkthrough

Suppose you want to flag users with increasing purchase frequency month-over-month.


WITH orders_by_month AS (
    SELECT
        user_id,
        DATE_TRUNC('month', order_date) AS order_month,
        COUNT(*) AS monthly_orders
    FROM orders
    GROUP BY user_id, order_month
),
order_trends AS (
    SELECT
        user_id,
        order_month,
        monthly_orders,
        LAG(monthly_orders) OVER (PARTITION BY user_id ORDER BY order_month) AS prev_month_orders
    FROM orders_by_month
)
SELECT
    user_id,
    order_month,
    monthly_orders,
    prev_month_orders,
    CASE WHEN monthly_orders > prev_month_orders THEN 1 ELSE 0 END AS increased_frequency
FROM order_trends;

5. Performance Considerations & Best Practices

5.1 Recursive CTE Limitations

  • Recursive CTEs can be slow on deep or large hierarchies. Most databases have a default recursion limit (e.g., 100 in PostgreSQL).
  • Ensure WHERE clauses or a MAXRECURSION limit to prevent runaway queries.
  • Prefer materialized path or adjacency list models for very large graphs.

5.2 Window Function Optimization

Indexing Strategies

  • Indexes on PARTITION BY and ORDER BY columns improve window function efficiency.

Partition Size Considerations

  • Smaller partitions run faster. Avoid using window functions over the entire dataset unless necessary.

Avoiding Performance Pitfalls

  • Don't combine too many window functions in a single query; performance may degrade.
  • Materialize intermediate results using CTEs or temporary tables for large transformations.

6. Practical Applications in AI/ML Workflows

6.1 Feature Engineering with SQL

Window functions are invaluable for feature engineering: lagged values, rolling statistics, and ranking features can be computed directly in SQL before model training.


-- Example: Last 3 purchase amounts as features
SELECT
    user_id,
    order_date,
    amount,
    LAG(amount, 1) OVER (PARTITION BY user_id ORDER BY order_date) AS prev1_amount,
    LAG(amount, 2) OVER (PARTITION BY user_id ORDER BY order_date) AS prev2_amount,
    LAG(amount, 3) OVER (PARTITION BY user_id ORDER BY order_date) AS prev3_amount
FROM orders;

6.2 Data Quality Checks

  • Finding Duplicates with ROW_NUMBER():

WITH ranked_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY created_at DESC
        ) AS rn
    FROM orders
)
SELECT *
FROM ranked_orders
WHERE rn > 1;
  • Identifying Outliers with NTILE():

SELECT
    user_id,
    amount,
    NTILE(100) OVER (ORDER BY amount) AS percentile
FROM orders;

Flagging percentile > 99 finds the top 1% outlier transactions.

6.3 A/B Test Analysis

  • Calculating cumulative metrics: Use SUM() or COUNT() as window functions for running totals by group.
  • Comparing treatment vs control: CTEs and window functions can help create user-level summaries and compare metrics between test groups.

WITH experiment_summary AS (
    SELECT
        group_id,
        user_id,
        SUM(amount) AS total_spent,
        COUNT(*) AS num_orders
    FROM experiment_orders
    GROUP BY group_id, user_id
)
SELECT
    group_id,
    AVG(total_spent) AS avg_spent,
    AVG(num_orders) AS avg_orders
FROM experiment_summary
GROUP BY group_id;

7. Interactive Code Examples

7.1 Dataset Introduction

Let's use the following example schema:

Table Columns
users user_id, name, signup_date
orders order_id, user_id, amount, order_date
events event_id, user_id, event_type, event_time

-- Create tables
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    name VARCHAR(50),
    signup_date DATE
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    amount DECIMAL(10,2),
    order_date DATE
);

CREATE TABLE events (
    event_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    event_type VARCHAR(50),
    event_time TIMESTAMP
);

-- Insert sample users
INSERT INTO users (user_id, name, signup_date) VALUES
(1, 'Alice', '2023-01-15'),
(2, 'Bob', '2023-02-12'),
(3, 'Carol', '2023-03-05'),
(4, 'Dave', '2023-03-21'),
(5, 'Eve', '2023-04-10');

-- Insert sample orders
INSERT INTO orders (order_id, user_id, amount, order_date) VALUES
(101, 1, 250.00, '2023-02-01'),
(102, 1, 130.00, '2023-03-15'),
(103, 2, 99.00, '2023-03-20'),
(104, 3, 500.00, '2023-04-05'),
(105, 4, 320.00, '2023-04-12'),
(106, 1, 210.00, '2023-04-20'),
(107, 2, 400.00, '2023-05-10'),
(108, 5, 150.00, '2023-05-15'),
(109, 3, 200.00, '2023-05-22');

-- Insert sample events
INSERT INTO events (event_id, user_id, event_type, event_time) VALUES
(201, 1, 'login', '2023-02-01 08:00:00'),
(202, 1, 'purchase', '2023-02-01 08:15:00'),
(203, 2, 'login', '2023-03-20 09:00:00'),
(204, 2, 'browse', '2023-03-20 09:05:00'),
(205, 3, 'login', '2023-04-05 10:00:00'),
(206, 3, 'purchase', '2023-04-05 10:20:00'),
(207, 4, 'login', '2023-04-12 11:00:00'),
(208, 5, 'login', '2023-05-15 12:00:00'),
(209, 5, 'purchase', '2023-05-15 12:20:00');

7.2 Exercise 1: Customer Segmentation

Segment customers into quartiles based on their total spend using NTILE(), with a CTE to preprocess order totals.


-- Preprocess total spend per customer
WITH customer_spend AS (
    SELECT user_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY user_id
)
SELECT
    user_id,
    total_spent,
    NTILE(4) OVER (ORDER BY total_spent DESC) AS spend_quartile
FROM customer_spend
ORDER BY spend_quartile, total_spent DESC;

This query allows you to identify your top spending (quartile 1) versus lower spending customers for targeted marketing or retention campaigns.


7.3 Exercise 2: Time-Series Analysis

7-Day Moving Averages

Calculate a 7-day moving average of order amounts for each user, using a window frame.


SELECT
    user_id,
    order_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY user_id
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM orders
ORDER BY user_id, order_date;

Year-over-Year (YoY) Growth Calculations

Assuming you have multiple years, calculate YoY growth in monthly sales:


WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS total_sales
    FROM orders
    GROUP BY month
)
SELECT
    month,
    total_sales,
    LAG(total_sales, 12) OVER (ORDER BY month) AS sales_last_year,
    CASE
        WHEN LAG(total_sales, 12) OVER (ORDER BY month) IS NULL THEN NULL
        ELSE
            (total_sales - LAG(total_sales, 12) OVER (ORDER BY month)) /
            NULLIF(LAG(total_sales, 12) OVER (ORDER BY month), 0)
    END AS yoy_growth
FROM monthly_sales
ORDER BY month;

The YoY growth formula here is:

\( \text{YoY Growth} = \frac{\text{This Year Sales} - \text{Last Year Sales}}{\text{Last Year Sales}} \)


7.4 Exercise 3: Sessionization

Identify user sessions from event logs, using LAG() to detect time gaps and CASE to start new sessions.


WITH event_with_lag AS (
    SELECT
        user_id,
        event_time,
        event_type,
        LAG(event_time) OVER (
            PARTITION BY user_id
            ORDER BY event_time
        ) AS prev_event_time
    FROM events
),
session_flags AS (
    SELECT
        *,
        CASE
            WHEN prev_event_time IS NULL
              OR EXTRACT(EPOCH FROM (event_time - prev_event_time))/60 > 30
            THEN 1 ELSE 0
        END AS new_session
    FROM event_with_lag
),
sessionized AS (
    SELECT
        *,
        SUM(new_session) OVER (
            PARTITION BY user_id
            ORDER BY event_time
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS session_id
    FROM session_flags
)
SELECT
    user_id,
    event_time,
    event_type,
    session_id
FROM sessionized
ORDER BY user_id, event_time;

This logic assigns a unique session_id to events for each user, restarting whenever there's more than 30 minutes between events.


8. Common Mistakes to Avoid

  • Misunderstanding Execution Order: CTEs are evaluated as part of the overall query plan. Don’t assume they’re always materialized or executed in the order written.
  • Over-nesting CTEs: Too many nested CTEs can harm readability and performance. Where possible, flatten logic or use temporary tables for repeated subqueries.
  • Window Function Frame Errors: Default window frames can lead to unexpected results. Always specify ROWS BETWEEN or RANGE when calculating running totals or moving averages.
  • Performance Anti-patterns: Avoid window functions over huge, unpartitioned datasets, or using CTEs for large data materializations without indexing or filtering.

9. Advanced Topics Preview

  • Window Functions in Python (Pandas Equivalents): Most window function logic can be replicated in Pandas using groupby + cumcount, rolling(), shift(), and rank().
    
    import pandas as pd
    df['running_total'] = df.groupby('user_id')['amount'].cumsum()
    df['moving_avg_7d'] = df.groupby('user_id')['amount'].rolling(7, min_periods=1).mean().reset_index(0, drop=True)
    df['lag_amount'] = df.groupby('user_id')['amount'].shift(1)
            
  • SQL for Graph Analysis (Recursive CTEs): Use recursive CTEs to traverse graphs, such as finding all descendants in a category tree or shortest paths in network data.
    
    WITH RECURSIVE descendants AS (
        SELECT id, parent_id FROM categories WHERE parent_id IS NULL
        UNION ALL
        SELECT c.id, c.parent_id
        FROM categories c
        JOIN descendants d ON c.parent_id = d.id
    )
    SELECT * FROM descendants;
            
  • Dynamic SQL with CTEs: Generate flexible, parameterized queries by constructing CTEs dynamically in application code, especially for reporting or dashboard tools.
    
    -- Example (in psql or stored procedure):
    EXECUTE format('
        WITH filtered_orders AS (
            SELECT * FROM orders WHERE order_date >= %L
        )
        SELECT * FROM filtered_orders
    ', start_date);
            

Conclusion: Mastering advanced SQL techniques—especially CTEs and window functions—equips data scientists with the tools to write cleaner, more efficient, and more insightful analyses. By applying these patterns, optimizing for performance, and combining SQL with modern analytics tools, you’ll unlock the full power of your data warehouse for both exploratory and production data science workflows.

Related Articles