Skip to content

Greenplum and MPP Architecture

Greenplum is an MPP (Massive Parallel Processing) analytical database based on PostgreSQL. It uses shared-nothing architecture where each node has its own CPU, memory, and disk.

Architecture

Client -> Master (QD: Query Dispatcher)
              |
    +---------+---------+
    Segment1  Segment2  Segment3  (QE: Query Executors)
    (mirror)  (mirror)  (mirror)
  • Master (QD): accepts queries, builds plan, dispatches to segments, aggregates results. Does NOT store user data
  • Segments (QE): store data, execute in parallel. Each is a PostgreSQL instance
  • Mirrors: standby copies for fault tolerance

Motion Operators

Operator What It Does Cost
Redistribute Motion Reshuffle data by hash key Expensive
Broadcast Motion Full copy of small table to all segments Medium
Gather Motion Collect results to master Light

Distribution Strategies

-- Hash distribution (best for JOIN keys)
CREATE TABLE orders (...) DISTRIBUTED BY (order_id);

-- Random (default)
CREATE TABLE logs (...) DISTRIBUTED RANDOMLY;

-- Replicated (small dimension tables)
CREATE TABLE countries (...) DISTRIBUTED REPLICATED;

Key Optimization

Distribute both tables in a JOIN by the join key to eliminate Redistribute Motion:

ALTER TABLE lineitem SET DISTRIBUTED BY (l_orderkey);
ALTER TABLE orders SET DISTRIBUTED BY (o_orderkey);
-- Redistribute Motion disappears from EXPLAIN plan

ANALYZE (Critical)

ANALYZE table_name;

Always run ANALYZE after bulk data changes. Greenplum optimizer depends on statistics. Without fresh stats after INSERT/UPDATE, optimizer uses stale estimates leading to suboptimal plans.

PostgreSQL auto-analyzes reasonably. Greenplum does NOT - explicit ANALYZE required.

Greenplum vs PostgreSQL: Indexes

  • PostgreSQL: create index -> optimizer uses it -> faster queries
  • Greenplum: create index -> optimizer ignores it (prefers sequential scan across segments)
-- Force index usage (rarely needed)
SET seq_page_cost = 100;
SET random_page_cost = 1;
SET optimizer = off;  -- GPORCA ignores hints

In MPP, sequential scan across many segments is often faster than index lookup.

Compression (TPC-H lineitem, scale=10)

Storage Size
Row-oriented (default) 34 GB
Row + compression 11 GB
Columnar + compression 7.5 GB

Table Types

Type Persistence Logged Visible
Permanent Survives restart Yes (WAL) All sessions
Temporary Session only No Current session
Unlogged Survives session, lost on crash No All sessions
External Data outside DB (S3, HDFS via PXF) N/A All sessions

Gotchas

  • Indexes are rarely useful in Greenplum - sequential scan dominates
  • ANALYZE must be run manually after bulk operations
  • External tables have no statistics - optimizer estimates heuristically
  • Network latency kills cross-datacenter cluster performance
  • GPORCA optimizer ignores hints - must disable for index forcing

See Also