Aggregate Functions and GROUP BY¶

★★★★★ Intermediate

Aggregate functions collapse multiple rows into single summary values. GROUP BY partitions rows into groups before aggregation, while HAVING filters the resulting groups.

Key Facts¶

COUNT(*) counts all rows including NULLs; COUNT(column) counts non-NULL values only
SUM, AVG, MIN, MAX all ignore NULLs
GROUP BY treats all NULLs as one group
HAVING filters groups after aggregation; WHERE filters rows before grouping
Cannot use aggregate functions in WHERE clause

Patterns¶

Basic Aggregation¶

SELECT COUNT(*) FROM users;
SELECT COUNT(DISTINCT city) FROM users;
SELECT SUM(amount), AVG(amount), MIN(amount), MAX(amount) FROM orders;

GROUP BY¶

SELECT city, COUNT(*) AS user_count
FROM users
GROUP BY city
ORDER BY user_count DESC;

-- Multiple grouping columns
SELECT country_code, district, COUNT(*) AS city_count
FROM city
GROUP BY country_code, district;

HAVING (Filter on Aggregated Results)¶

SELECT genre, COUNT(*) AS cnt
FROM album
GROUP BY genre
HAVING COUNT(*) > 5
ORDER BY cnt DESC;

-- Combine WHERE and HAVING
SELECT continent, AVG(life_expectancy) AS avg_life
FROM country
WHERE population > 1000000        -- filter rows before grouping
GROUP BY continent
HAVING avg_life > 70              -- filter groups after aggregation
ORDER BY avg_life DESC;

COUNT Performance on Large Tables¶

SELECT COUNT(*) FROM large_table requires full table/index scan - MVCC means different transactions see different row counts.

Alternatives for approximate counts:

-- PostgreSQL: approximate from statistics
SELECT reltuples::bigint FROM pg_class WHERE relname = 'large_table';

-- MySQL: approximate from information_schema
SELECT table_rows FROM information_schema.tables
WHERE table_name = 'large_table';

Other strategies: materialized count table updated by triggers, Redis counter incremented on INSERT/DELETE, HyperLogLog for distinct count approximation.

Gotchas¶

AVG ignores NULLs - AVG of [10, NULL, 20] is 15, not 10
Columns in SELECT must either appear in GROUP BY or be inside aggregate functions
COUNT(DISTINCT col) is slower than COUNT(*) - requires deduplication
MySQL allows non-aggregated columns in SELECT without GROUP BY (sql_mode dependent) - PostgreSQL does not