
Advanced SQL for Data Analysis: Mastering CTEs, Window Functions, and Performance Optimization
In the age of data-driven decision-making, SQL remains a cornerstone skill for data scientists, analysts, and engineers. While basic querying is essential, mastering advanced SQL concepts—like Common Table Expressions (CTEs) and window functions—unlocks new levels of analytical power and performance. This comprehensive guide explores these techniques, demonstrates real-world applications, and reveals best practices for optimizing SQL in modern data science workflows.
1. Introduction
Complex data analysis often demands queries that are not only accurate but also maintainable, efficient, and expressive. As datasets grow and business questions become nuanced, subqueries and basic aggregations can quickly become unwieldy. Enter Common Table Expressions (CTEs) and window functions: two advanced SQL features that help streamline code, enable sophisticated analytics, and improve performance when used correctly. In this article, you'll learn how and when to use these tools, illustrated with real-world examples and actionable tips.
2. Common Table Expressions (CTEs) Explained
2.1 What are CTEs?
A Common Table Expression (CTE) is a temporary, named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs are defined using the WITH clause and allow for clearer, modular SQL code—especially when working with complex transformations or recursive queries.
CTE Syntax Breakdown: WITH Clause
WITH cte_name AS (
-- SQL query
)
SELECT * FROM cte_name;
You can define multiple CTEs by separating them with commas:
WITH first_cte AS (...),
second_cte AS (...)
SELECT ...
FROM first_cte
JOIN second_cte ON ...
When to Use CTEs vs. Subqueries or Temp Tables
- Readability: CTEs make queries easier to read, debug, and maintain by breaking complex logic into named blocks.
- Reusability: Use CTEs when you need to reference the same result set multiple times in a query.
- Alternatives: Subqueries are better for simple, inline filtering. Temporary tables are preferable when reusing intermediate results across multiple queries or sessions.
2.2 Types of CTEs with Examples
Simple CTE: Organizing Complex Queries
WITH high_value_orders AS (
SELECT order_id, user_id, amount
FROM orders
WHERE amount > 1000
)
SELECT u.name, hvo.amount
FROM users u
JOIN high_value_orders hvo ON u.user_id = hvo.user_id;
Recursive CTE: Hierarchical Data (Org Charts, Categories)
Recursive CTEs allow you to traverse hierarchical structures—think organization charts or product categories—without procedural code.
WITH RECURSIVE org_chart AS (
SELECT employee_id, manager_id, name, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.name, oc.level + 1
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart
ORDER BY level, name;
2.3 Real-World Data Science Example: Customer Cohort Analysis
Suppose you want to analyze customer retention by signup month—a common cohort analysis. Here's how a CTE can turn a messy subquery into elegant SQL.
Before: Messy Subquery
SELECT
DATE_TRUNC('month', u.signup_date) AS cohort_month,
DATE_TRUNC('month', o.order_date) AS order_month,
COUNT(DISTINCT o.user_id) AS active_users
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE o.order_date IS NOT NULL
GROUP BY cohort_month, order_month
ORDER BY cohort_month, order_month;
After: Clean CTE Approach
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
),
orders_by_month AS (
SELECT
user_id,
DATE_TRUNC('month', order_date) AS order_month
FROM orders
)
SELECT
uc.cohort_month,
obm.order_month,
COUNT(DISTINCT obm.user_id) AS active_users
FROM user_cohorts uc
JOIN orders_by_month obm ON uc.user_id = obm.user_id
GROUP BY uc.cohort_month, obm.order_month
ORDER BY uc.cohort_month, obm.order_month;
This structure clarifies intent, makes debugging easier, and facilitates further extension.
3. Window Functions Deep Dive
3.1 Window Functions vs. GROUP BY
- GROUP BY: Aggregates rows to a single summary row per group, collapsing detail.
- Window Functions: Calculate values across related rows while preserving row-level detail.
For example, to calculate a running total of sales per user:
-- Using GROUP BY (loses detail)
SELECT user_id, SUM(amount) AS total_spent
FROM orders
GROUP BY user_id;
-- Using Window Function (retains each order)
SELECT
user_id,
order_id,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total
FROM orders;
3.2 Core Window Function Types
Ranking Functions
ROW_NUMBER(): Unique sequential number for each row within a partition.RANK(): Similar, but ties get the same rank and next rank(s) are skipped.DENSE_RANK(): LikeRANK(), but no gaps in ranking sequence.
SELECT
user_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS order_seq
FROM orders;
Analytic Functions
LAG(): Access value from previous row in partition.LEAD(): Access value from next row in partition.FIRST_VALUE(),LAST_VALUE(): Retrieve first/last value in window frame.
SELECT
user_id,
order_date,
amount,
LAG(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS prev_order_amount
FROM orders;
Aggregate Functions with OVER()
SUM(),AVG(),COUNT(), etc. can be used as window functions.
SELECT
user_id,
amount,
SUM(amount) OVER (PARTITION BY user_id) AS total_user_spend
FROM orders;
3.3 The OVER() Clause Explained
- PARTITION BY: Defines groups (partitions) within the dataset.
- ORDER BY: Specifies order of rows within each partition.
- ROWS/RANGE: Defines a subset of rows for frame-based calculations (e.g., running totals, moving averages).
Examples:
-- Partition by user, ordered by date
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date)
-- Moving average: past 6 days + current
AVG(amount) OVER (
PARTITION BY user_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
You can express a 7-day moving average mathematically as:
\( \text{MovingAvg}_t = \frac{1}{7} \sum_{i=t-6}^{t} x_i \)
3.4 Advanced Pattern: Running Totals & Moving Averages Example
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
AVG(amount) OVER (
PARTITION BY user_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM orders;
4. CTEs + Window Functions: Power Combination
4.1 Solving Complex Problems
Combining CTEs with window functions lets you break down multi-step analytics, such as funnel analysis, retention curves, or rolling user engagement, into clear, modular stages.
4.2 Step-by-Step Walkthrough
Suppose you want to flag users with increasing purchase frequency month-over-month.
WITH orders_by_month AS (
SELECT
user_id,
DATE_TRUNC('month', order_date) AS order_month,
COUNT(*) AS monthly_orders
FROM orders
GROUP BY user_id, order_month
),
order_trends AS (
SELECT
user_id,
order_month,
monthly_orders,
LAG(monthly_orders) OVER (PARTITION BY user_id ORDER BY order_month) AS prev_month_orders
FROM orders_by_month
)
SELECT
user_id,
order_month,
monthly_orders,
prev_month_orders,
CASE WHEN monthly_orders > prev_month_orders THEN 1 ELSE 0 END AS increased_frequency
FROM order_trends;
5. Performance Considerations & Best Practices
5.1 Recursive CTE Limitations
- Recursive CTEs can be slow on deep or large hierarchies. Most databases have a default recursion limit (e.g., 100 in PostgreSQL).
- Ensure
WHEREclauses or aMAXRECURSIONlimit to prevent runaway queries. - Prefer materialized path or adjacency list models for very large graphs.
5.2 Window Function Optimization
Indexing Strategies
- Indexes on
PARTITION BYandORDER BYcolumns improve window function efficiency.
Partition Size Considerations
- Smaller partitions run faster. Avoid using window functions over the entire dataset unless necessary.
Avoiding Performance Pitfalls
- Don't combine too many window functions in a single query; performance may degrade.
- Materialize intermediate results using CTEs or temporary tables for large transformations.
6. Practical Applications in AI/ML Workflows
6.1 Feature Engineering with SQL
Window functions are invaluable for feature engineering: lagged values, rolling statistics, and ranking features can be computed directly in SQL before model training.
-- Example: Last 3 purchase amounts as features
SELECT
user_id,
order_date,
amount,
LAG(amount, 1) OVER (PARTITION BY user_id ORDER BY order_date) AS prev1_amount,
LAG(amount, 2) OVER (PARTITION BY user_id ORDER BY order_date) AS prev2_amount,
LAG(amount, 3) OVER (PARTITION BY user_id ORDER BY order_date) AS prev3_amount
FROM orders;
6.2 Data Quality Checks
- Finding Duplicates with ROW_NUMBER():
WITH ranked_orders AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY created_at DESC
) AS rn
FROM orders
)
SELECT *
FROM ranked_orders
WHERE rn > 1;
- Identifying Outliers with NTILE():
SELECT
user_id,
amount,
NTILE(100) OVER (ORDER BY amount) AS percentile
FROM orders;
Flagging percentile > 99 finds the top 1% outlier transactions.
6.3 A/B Test Analysis
- Calculating cumulative metrics: Use
SUM()orCOUNT()as window functions for running totals by group. - Comparing treatment vs control: CTEs and window functions can help create user-level summaries and compare metrics between test groups.
WITH experiment_summary AS (
SELECT
group_id,
user_id,
SUM(amount) AS total_spent,
COUNT(*) AS num_orders
FROM experiment_orders
GROUP BY group_id, user_id
)
SELECT
group_id,
AVG(total_spent) AS avg_spent,
AVG(num_orders) AS avg_orders
FROM experiment_summary
GROUP BY group_id;
7. Interactive Code Examples
7.1 Dataset Introduction
Let's use the following example schema:
| Table | Columns |
|---|---|
| users | user_id, name, signup_date |
| orders | order_id, user_id, amount, order_date |
| events | event_id, user_id, event_type, event_time |
-- Create tables
CREATE TABLE users (
user_id INT PRIMARY KEY,
name VARCHAR(50),
signup_date DATE
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
user_id INT REFERENCES users(user_id),
amount DECIMAL(10,2),
order_date DATE
);
CREATE TABLE events (
event_id INT PRIMARY KEY,
user_id INT REFERENCES users(user_id),
event_type VARCHAR(50),
event_time TIMESTAMP
);
-- Insert sample users
INSERT INTO users (user_id, name, signup_date) VALUES
(1, 'Alice', '2023-01-15'),
(2, 'Bob', '2023-02-12'),
(3, 'Carol', '2023-03-05'),
(4, 'Dave', '2023-03-21'),
(5, 'Eve', '2023-04-10');
-- Insert sample orders
INSERT INTO orders (order_id, user_id, amount, order_date) VALUES
(101, 1, 250.00, '2023-02-01'),
(102, 1, 130.00, '2023-03-15'),
(103, 2, 99.00, '2023-03-20'),
(104, 3, 500.00, '2023-04-05'),
(105, 4, 320.00, '2023-04-12'),
(106, 1, 210.00, '2023-04-20'),
(107, 2, 400.00, '2023-05-10'),
(108, 5, 150.00, '2023-05-15'),
(109, 3, 200.00, '2023-05-22');
-- Insert sample events
INSERT INTO events (event_id, user_id, event_type, event_time) VALUES
(201, 1, 'login', '2023-02-01 08:00:00'),
(202, 1, 'purchase', '2023-02-01 08:15:00'),
(203, 2, 'login', '2023-03-20 09:00:00'),
(204, 2, 'browse', '2023-03-20 09:05:00'),
(205, 3, 'login', '2023-04-05 10:00:00'),
(206, 3, 'purchase', '2023-04-05 10:20:00'),
(207, 4, 'login', '2023-04-12 11:00:00'),
(208, 5, 'login', '2023-05-15 12:00:00'),
(209, 5, 'purchase', '2023-05-15 12:20:00');
7.2 Exercise 1: Customer Segmentation
Segment customers into quartiles based on their total spend using NTILE(), with a CTE to preprocess order totals.
-- Preprocess total spend per customer
WITH customer_spend AS (
SELECT user_id, SUM(amount) AS total_spent
FROM orders
GROUP BY user_id
)
SELECT
user_id,
total_spent,
NTILE(4) OVER (ORDER BY total_spent DESC) AS spend_quartile
FROM customer_spend
ORDER BY spend_quartile, total_spent DESC;
This query allows you to identify your top spending (quartile 1) versus lower spending customers for targeted marketing or retention campaigns.
7.3 Exercise 2: Time-Series Analysis
7-Day Moving Averages
Calculate a 7-day moving average of order amounts for each user, using a window frame.
SELECT
user_id,
order_date,
amount,
AVG(amount) OVER (
PARTITION BY user_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM orders
ORDER BY user_id, order_date;
Year-over-Year (YoY) Growth Calculations
Assuming you have multiple years, calculate YoY growth in monthly sales:
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total_sales
FROM orders
GROUP BY month
)
SELECT
month,
total_sales,
LAG(total_sales, 12) OVER (ORDER BY month) AS sales_last_year,
CASE
WHEN LAG(total_sales, 12) OVER (ORDER BY month) IS NULL THEN NULL
ELSE
(total_sales - LAG(total_sales, 12) OVER (ORDER BY month)) /
NULLIF(LAG(total_sales, 12) OVER (ORDER BY month), 0)
END AS yoy_growth
FROM monthly_sales
ORDER BY month;
The YoY growth formula here is:
\( \text{YoY Growth} = \frac{\text{This Year Sales} - \text{Last Year Sales}}{\text{Last Year Sales}} \)
7.4 Exercise 3: Sessionization
Identify user sessions from event logs, using LAG() to detect time gaps and CASE to start new sessions.
WITH event_with_lag AS (
SELECT
user_id,
event_time,
event_type,
LAG(event_time) OVER (
PARTITION BY user_id
ORDER BY event_time
) AS prev_event_time
FROM events
),
session_flags AS (
SELECT
*,
CASE
WHEN prev_event_time IS NULL
OR EXTRACT(EPOCH FROM (event_time - prev_event_time))/60 > 30
THEN 1 ELSE 0
END AS new_session
FROM event_with_lag
),
sessionized AS (
SELECT
*,
SUM(new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM session_flags
)
SELECT
user_id,
event_time,
event_type,
session_id
FROM sessionized
ORDER BY user_id, event_time;
This logic assigns a unique session_id to events for each user, restarting whenever there's more than 30 minutes between events.
8. Common Mistakes to Avoid
- Misunderstanding Execution Order: CTEs are evaluated as part of the overall query plan. Don’t assume they’re always materialized or executed in the order written.
- Over-nesting CTEs: Too many nested CTEs can harm readability and performance. Where possible, flatten logic or use temporary tables for repeated subqueries.
- Window Function Frame Errors: Default window frames can lead to unexpected results. Always specify
ROWS BETWEENorRANGEwhen calculating running totals or moving averages. - Performance Anti-patterns: Avoid window functions over huge, unpartitioned datasets, or using CTEs for large data materializations without indexing or filtering.
9. Advanced Topics Preview
- Window Functions in Python (Pandas Equivalents): Most window function logic can be replicated in Pandas using
groupby+cumcount,rolling(),shift(), andrank().import pandas as pd df['running_total'] = df.groupby('user_id')['amount'].cumsum() df['moving_avg_7d'] = df.groupby('user_id')['amount'].rolling(7, min_periods=1).mean().reset_index(0, drop=True) df['lag_amount'] = df.groupby('user_id')['amount'].shift(1) - SQL for Graph Analysis (Recursive CTEs): Use recursive CTEs to traverse graphs, such as finding all descendants in a category tree or shortest paths in network data.
WITH RECURSIVE descendants AS ( SELECT id, parent_id FROM categories WHERE parent_id IS NULL UNION ALL SELECT c.id, c.parent_id FROM categories c JOIN descendants d ON c.parent_id = d.id ) SELECT * FROM descendants; - Dynamic SQL with CTEs: Generate flexible, parameterized queries by constructing CTEs dynamically in application code, especially for reporting or dashboard tools.
-- Example (in psql or stored procedure): EXECUTE format(' WITH filtered_orders AS ( SELECT * FROM orders WHERE order_date >= %L ) SELECT * FROM filtered_orders ', start_date);
Conclusion: Mastering advanced SQL techniques—especially CTEs and window functions—equips data scientists with the tools to write cleaner, more efficient, and more insightful analyses. By applying these patterns, optimizing for performance, and combining SQL with modern analytics tools, you’ll unlock the full power of your data warehouse for both exploratory and production data science workflows.
Related Articles
- Essential Tools and Tutorials for Data Science: Python, Databricks, BigQuery, and AI Applications
- Interview Questions Data Analysis: Real Examples with Solutions (2025 Guide)
- Quant Analyst Interview Questions at Millennium
- Data Scientist Interview Questions - Amazon
- Quant Analyst Interview Prep: 50 Fundamental Questions (With Answers & Frameworks)
