Zero-Copy and Disk I/O¶

Kafka achieves high throughput through sequential disk I/O, OS page cache utilization, and zero-copy transfer via sendfile(), making CPU almost never the bottleneck.

Key Facts¶

Kafka uses sendfile() system call (zero-copy) to transfer data from disk to network socket, bypassing userspace buffers
Data on disk is in the exact network wire format - no CPU transformation needed
Sequential disk I/O is faster than random memory access at scale
Data flows: disk -> kernel page cache -> NIC (network card) via DMA, CPU not involved
Typical failure cascade: network dies first, then disk, CPU almost never
Modern 10Gbps datacenter networks can handle dozens of independent readers per topic
Multiple consumers are very cheap because data is served from page cache
Zero-copy works ONLY when SSL is NOT enabled on the broker listener
Messages are batched and compressed at the batch level, stored compressed on disk, decompressed by consumers
Batch headers store the latest CreateTime of contained messages

Patterns¶

How Zero-Copy Works¶

Traditional (without zero-copy):
  Disk -> Kernel Buffer -> User Space Buffer -> Socket Buffer -> NIC
  (4 copies, 4 context switches)

Kafka (with zero-copy / sendfile):
  Disk -> Kernel Page Cache -> NIC (via DMA)
  (2 copies, 0 context switches, CPU uninvolved)

Why Kafka Is Fast¶

1. Sequential I/O:
   - Append-only writes (no random seeks)
   - Sequential reads (consumers read in order)
   - HDD sequential: 200-300 MB/s vs random: 0.1-1 MB/s

2. Page Cache:
   - OS caches recently written/read data
   - Hot data served from memory without Kafka knowing
   - JVM heap stays clean (no GC pressure from data)

3. Batching:
   - Producer accumulates messages into batches
   - One batch = one disk write, one network send
   - 100 messages = 1 Kafka write

4. Compression:
   - Batch-level compression (lz4/zstd/snappy/gzip)
   - Stored compressed, transferred compressed
   - Consumer decompresses on read

Segment File Structure¶

/var/kafka/data/
  orders-0/                          # topic "orders", partition 0
    00000000000000000000.log         # segment file (records)
    00000000000000000000.index       # offset-to-position index
    00000000000000000000.timeindex   # timestamp-to-offset index
    00000000000000123456.log         # next segment (starts at offset 123456)
    leader-epoch-checkpoint

.log - actual message data (batch-compressed)
.index - sparse offset-to-byte-position mapping
.timeindex - sparse timestamp-to-offset mapping

Gotchas¶

SSL disables zero-copy - with SSL enabled, data must be encrypted in userspace, adding CPU overhead and eliminating the sendfile() optimization
JVM GC is NOT a concern for data path - Kafka stores data in page cache (outside JVM heap); GC pressure comes from connection handling and metadata, not data
Page cache thrashing - if the broker serves too many different partitions that don't fit in RAM, page cache evictions cause disk reads; monitor cache hit ratio
Segment stays open for a long time with slow writers - actual data retention can be much longer than configured because retention applies only to closed segments

Zero-Copy and Disk I/O¶

Key Facts¶

Patterns¶

How Zero-Copy Works¶

Why Kafka Is Fast¶

Segment File Structure¶

Gotchas¶

See Also¶