Skip to content

Aggregate Functions and GROUP BY

Aggregate functions collapse multiple rows into single summary values. GROUP BY partitions rows into groups before aggregation, while HAVING filters the resulting groups.

Key Facts

  • COUNT(*) counts all rows including NULLs; COUNT(column) counts non-NULL values only
  • SUM, AVG, MIN, MAX all ignore NULLs
  • GROUP BY treats all NULLs as one group
  • HAVING filters groups after aggregation; WHERE filters rows before grouping
  • Cannot use aggregate functions in WHERE clause

Patterns

Basic Aggregation

SELECT COUNT(*) FROM users;
SELECT COUNT(DISTINCT city) FROM users;
SELECT SUM(amount), AVG(amount), MIN(amount), MAX(amount) FROM orders;

GROUP BY

SELECT city, COUNT(*) AS user_count
FROM users
GROUP BY city
ORDER BY user_count DESC;

-- Multiple grouping columns
SELECT country_code, district, COUNT(*) AS city_count
FROM city
GROUP BY country_code, district;

HAVING (Filter on Aggregated Results)

SELECT genre, COUNT(*) AS cnt
FROM album
GROUP BY genre
HAVING COUNT(*) > 5
ORDER BY cnt DESC;

-- Combine WHERE and HAVING
SELECT continent, AVG(life_expectancy) AS avg_life
FROM country
WHERE population > 1000000        -- filter rows before grouping
GROUP BY continent
HAVING avg_life > 70              -- filter groups after aggregation
ORDER BY avg_life DESC;

COUNT Performance on Large Tables

SELECT COUNT(*) FROM large_table requires full table/index scan - MVCC means different transactions see different row counts.

Alternatives for approximate counts:

-- PostgreSQL: approximate from statistics
SELECT reltuples::bigint FROM pg_class WHERE relname = 'large_table';

-- MySQL: approximate from information_schema
SELECT table_rows FROM information_schema.tables
WHERE table_name = 'large_table';

Other strategies: materialized count table updated by triggers, Redis counter incremented on INSERT/DELETE, HyperLogLog for distinct count approximation.

Gotchas

  • AVG ignores NULLs - AVG of [10, NULL, 20] is 15, not 10
  • Columns in SELECT must either appear in GROUP BY or be inside aggregate functions
  • COUNT(DISTINCT col) is slower than COUNT(*) - requires deduplication
  • MySQL allows non-aggregated columns in SELECT without GROUP BY (sql_mode dependent) - PostgreSQL does not

See Also