Inside Spark RDDs, From Logical Handles to Physical Bytes

A Microaggression-Free Dive into Spark’s Actual Memory Layout

Oct 17, 2025

∙ Paid

Spark looks simple on the surface, yet under the hood it weaves together a compact set of data structures with careful execution rules. This article turns our discussion into a clean narrative you can copy into Substack. It focuses on what an RDD is, how it becomes physical data in HDFS, databases, or RAM, how broadcasts and accumulators share state safely, and how Spark optimizes the work without drowning you in boilerplate.

What an RDD Really Is

An Resilient Distributed Dataset (RDD) is an immutable handle over a distributed collection. It keeps just enough metadata to answer four questions: how many partitions exist? where each partition prefers to run? how to compute a given partition? how to rebuild lost data from its parents? That last item is lineage, the compact recipe that makes RDDs resilient. You can chain transformations freely, Spark delays materialization until you ask for an action, and when a node fails Spark recomputes only the missing partitions.

At the implementation level an RDD is a small Scala object with three essential hooks. It can return the array of partition descriptors. It can produce an iterator for a given partition. It can suggest preferred locations for that partition. Everything else builds around these hooks, including caching, scheduling, and recovery.

Inside DrMark’s Lab

Inside Spark RDDs, From Logical Handles to Physical Bytes

A Microaggression-Free Dive into Spark’s Actual Memory Layout

What an RDD Really Is

This post is for paid subscribers