SQL for Analytics¶

SQL is the foundational tool for data analysts, but knowing SQL syntax alone doesn't make you an analyst - analytical thinking about what to measure and why is the real skill. This entry covers SQL patterns specifically relevant to BI and product analytics.

Key Facts¶

SQL execution order (important for debugging): FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT
Window functions compute over a "window" of rows without collapsing rows
CTEs (Common Table Expressions) are cleaner than nested subqueries
Analytics-specific patterns: cohort retention, funnel analysis, DAU/MAU stickiness
Connecting Python to SQL databases enables combining SQL querying with pandas analysis

Patterns¶

Core Query Structure¶

SELECT column1, column2, COUNT(*) as count
FROM table_name
WHERE condition
GROUP BY column1, column2
HAVING aggregate_condition
ORDER BY column1 DESC
LIMIT 100;

Filtering¶

WHERE status = 'active'
WHERE city IN ('Moscow', 'SPb', 'Kazan')
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
WHERE name LIKE 'Ivan%'          -- starts with
WHERE email LIKE '%@mail.ru'     -- ends with
WHERE phone IS NULL
WHERE status = 'active' AND city = 'Moscow'
WHERE NOT status = 'cancelled'

JOINs¶

-- INNER: only matching rows
SELECT u.name, o.order_id, o.amount
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id;

-- LEFT: all left + matching right (NULLs where no match)
SELECT u.name, COUNT(o.order_id) as order_count
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.name;

-- Multiple JOINs
SELECT u.name, o.order_id, p.product_name
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id;

CTEs (Common Table Expressions)¶

WITH active_users AS (
    SELECT user_id, name
    FROM users
    WHERE last_activity > CURRENT_DATE - INTERVAL '30 days'
),
user_orders AS (
    SELECT user_id, SUM(amount) as total
    FROM orders
    GROUP BY user_id
)
SELECT a.name, COALESCE(u.total, 0) as total_spent
FROM active_users a
LEFT JOIN user_orders u ON a.user_id = u.user_id;

Window Functions¶

-- Ranking
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) as order_num
RANK() OVER (ORDER BY amount DESC) as rank_by_amount
DENSE_RANK() OVER (ORDER BY amount DESC) as dense_rank

-- Running totals and rolling averages
SUM(amount) OVER (ORDER BY order_date) as running_total
AVG(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_avg

-- Lag/Lead (previous/next row access)
LAG(amount, 1) OVER (ORDER BY order_date) as prev_day
LEAD(amount, 1) OVER (ORDER BY order_date) as next_day
amount - LAG(amount, 1) OVER (ORDER BY order_date) as day_change

-- Percentiles
NTILE(10) OVER (ORDER BY total_spent) as decile
PERCENT_RANK() OVER (ORDER BY total_spent) as percentile

Analytics SQL Patterns¶

Cohort retention - see cohort retention analysis for full pattern.

Funnel analysis:

SELECT
    COUNT(CASE WHEN event = 'page_view' THEN user_id END) as step1,
    COUNT(CASE WHEN event = 'add_to_cart' THEN user_id END) as step2,
    COUNT(CASE WHEN event = 'checkout_start' THEN user_id END) as step3,
    COUNT(CASE WHEN event = 'purchase' THEN user_id END) as step4
FROM events
WHERE event_date = CURRENT_DATE;

DAU/MAU stickiness - see product metrics framework for full pattern.

Python + SQL Integration¶

import pandas as pd
import psycopg2

conn = psycopg2.connect(
    host='localhost', database='mydb',
    user='analyst', password='pass'
)

df = pd.read_sql("""
    SELECT user_id, COUNT(*) as orders
    FROM orders
    WHERE created_at >= '2024-01-01'
    GROUP BY user_id
    ORDER BY orders DESC
    LIMIT 100
""", conn)

# Jupyter magic
%load_ext sql
%sql postgresql://user:pass@host/db
result = %sql SELECT COUNT(*) FROM orders

Gotchas¶

Execution order matters: you can't use a column alias from SELECT in WHERE clause (WHERE runs before SELECT)
NULL comparisons: WHERE col = NULL never matches - use WHERE col IS NULL
Window functions don't reduce row count - use them when you need per-row calculations alongside aggregates
Correlated subqueries run once per outer row and can be extremely slow - prefer JOINs or CTEs
HAVING filters after aggregation, WHERE filters before - confusing them is a common bug