MapReduce¶
MapReduce is a programming model for distributed data processing. Two user-defined operations: Map (transform/filter) and Reduce (aggregate). The framework guarantees all records with the same key arrive at the same Reducer.
Paradigm¶
Map: (K1, V1) -> list(K2, V2) -- transform each record
Reduce: (K2, list(V2)) -> list(K3, V3) -- aggregate by key
Execution Pipeline¶
- Split: InputFormat splits data into chunks. Each split -> one Mapper
- Map: Each Mapper processes split via RecordReader. Output to circular buffer
- Spill: Buffer fills -> data spilled to local disk (NOT HDFS). Partitioned by
hash(key) % numReducers - Sort/Merge: Mapper sorts and merge-sorts spill files per Reducer partition
- Shuffle: Reducers pull data from Mappers (v2 reversed direction from v1)
- Reduce: Merge-sort received files, process sorted key-value stream
- Output: One output file per Reducer on HDFS
Hadoop Streaming (Python)¶
hadoop jar $HADOOP_MAPRED_HOME/hadoop-streaming.jar \
-mapper mapper.py \
-reducer reducer.py \
-file /local/path/mapper.py \
-file /local/path/reducer.py \
-input /hdfs/input/ \
-output /hdfs/output/
Key Details¶
- Split != HDFS block: Splits are logical (can be from DB, S3); blocks are physical
- InputFormat types: TextInputFormat (default), NLineInputFormat (N lines per mapper), DBInputFormat
- Number of Mappers: determined by InputFormat split logic, not directly configurable
- Number of Reducers: configurable via
mapreduce.job.reduces _SUCCESSfile: empty marker indicating job completion. Files starting with_or.ignored by subsequent jobs- Output directory must NOT exist before job launch (safety against overwriting)
MapReduce v1 vs v2¶
- v1: Mappers pushed data to Reducers -> DDoS on Reducers
- v2: Reducers pull from Mappers (reversed direction)
Gotchas¶
- Container initialization overhead dominates for small datasets (40s for 100KB with 31 mappers vs 9s with 3)
- Split is NOT the same as HDFS block - different abstraction layers
- Hadoop Streaming (Python) lacks some Java MR features (no setup/cleanup hooks)
- MapReduce is legacy - prefer Spark for new workloads
See Also¶
- hadoop hdfs - storage layer
- yarn resource management - resource allocation
- apache hive - SQL interface to MapReduce
- apache spark core - modern replacement