Spark Streaming¶

★★★★★ Intermediate

Spark Streaming processes real-time data via micro-batching - incoming stream is split into small batch jobs processed by Spark Core. Structured Streaming (recommended) treats the stream as an unbounded DataFrame.

Two Approaches¶

	Classic DStreams	Structured Streaming
Abstraction	Stream as micro-batch RDDs	Stream as infinite DataFrame
Side effects	Allowed	Input -> transform -> output only
API	Lower-level	Same DataFrame API as batch
Status	Legacy	Recommended

Structured Streaming Pattern¶

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("streaming").getOrCreate()

# Source (Kafka)
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "host:9092") \
    .option("subscribe", "topic_name") \
    .load()

# Process
result = df.selectExpr("CAST(value AS STRING)")

# Sink
query = result.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

DStreams API (Legacy)¶

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, batchDuration=1)  # 1-second ticks
# Define source, transformations
ssc.start()
ssc.awaitTermination()

Delivery Guarantees¶

Guarantee	Meaning	Trade-off
At-most-once	Messages may be lost, never duplicated	Fastest
At-least-once	Messages never lost, may be duplicated	Requires deduplication
Exactly-once	Messages delivered exactly once	Most complex, transactional support needed

Typical Pipeline¶

Kafka -> Spark Structured Streaming -> Database/DWH

Key Facts¶

Structured Streaming does not allow terminal actions (count, collect) mid-pipeline
Window functions apply over time-based windows
Can combine streaming data with static (cached) data via joins
Triggers: time-based, event-count, or continuous

Gotchas¶

Spark Streaming is NOT true streaming - it is micro-batching
Development/debugging is harder than batch - use Jupyter for prototyping
Structured Streaming output modes: append, complete, update
State management adds complexity for aggregations over time windows