File Formats for Big Data¶

Choosing the right file format significantly impacts storage efficiency, query performance, and compatibility across the Hadoop/Spark ecosystem.

Format Comparison¶

Format	Type	Splittable	Schema	Compression	Best For
Text/CSV	Row	Yes (uncompressed)	No	External	Simple interchange
SequenceFile	Row (binary KV)	Yes	No	Block/Record	MR intermediate data
Avro	Row (binary)	Yes	Embedded	Block	Schema evolution, streaming
ORC	Columnar	Yes	Embedded	Column-level	Hive-optimized analytics
Parquet	Columnar	Yes	Embedded	Column-level	Cross-ecosystem analytics

Format	Use Case
Parquet	Default for analytics. Works with Spark, Presto, Hive, Impala
ORC	Hive-optimized workloads. Better compression than Parquet in Hive
Avro	Schema evolution, Kafka messages, streaming ingest
CSV/JSON	Data exchange, APIs, human-readable staging

Columnar formats (Parquet, ORC) read only needed columns - critical for wide analytical tables
Predicate pushdown: query engine pushes filter to storage layer, reading only matching row groups
Parquet and ORC both support nested data structures
Avro is the standard for Kafka message serialization with Schema Registry
Always prefer Parquet/ORC for big data; use CSV only for interchange