Introduction
In this blog, we’ll explore the journey of code execution in Spark, breaking down each step of the process to help you understand how Spark translates your instructions into actionable tasks across a cluster.
Spark follows a master-slave architecture where the driver coordinates the execution, and the executors run the actual tasks.
The Making of a Spark Job.
In Spark, a job is created when an action like (collect(), count(), saveAsTextFile(), etc.) is called on a transformation chain. Transformations are lazy operations (e.g., map(), filter()) that define how data should be processed but don’t actually trigger execution.
Once an action is called, Spark creates a DAG (Directed Acyclic Graph) of stages that represents the series of operations needed to compute the result.
The job is created only when the action is called, and Spark starts breaking the job into stages.
Behind the Division: How Spark Splits Jobs into Stages
The DAG Scheduler is responsible for dividing a job into stages.
Criteria for Dividing a Job into Stages
Spark breaks a job into stages based on narrow and wide transformations.
Narrow transformations (e.g., map(), filter()) do not require shuffling data between partitions and can be executed in the same stage.
Wide transformations (e.g., groupByKey(), reduceByKey(), join()) require shuffling data between nodes and, therefore, trigger a stage boundary.
Each stage consists of a set of transformations that can be pipelined together (i.e., no shuffling is required within a stage).
Steps for Dividing into Stages
The DAG Scheduler looks at the logical plan and identifies the shuffle boundaries (i.e., wide transformations like join, distinct, groupBy, etc).
Each shuffle boundary signifies a stage break. The DAG Scheduler creates multiple stages, each representing a pipeline of transformations that can be executed without shuffling.
From Stages to Tasks: How Does Spark Do It?
The DAG Scheduler sends stages to the Task Scheduler, which divides each stage into multiple tasks. Each stage is further divided into tasks based on the number of partitions in the RDD or DataFrame. Each partition is processed by one task.
For example, if a stage has an RDD with 10 partitions, Spark will create 10 tasks, one for each partition.
The Task Scheduler creates a task for every partition in the stage and assigns it to an executor for execution.
Key Points:
Number of Tasks = Number of Partitions.
Tasks are independent units of work executed on different data partitions.
Task Execution
The Task Scheduler is responsible for sending tasks to the cluster and running them. Executors are launched on worker nodes by the cluster manager, and the Task Scheduler assigns tasks to these executors.
Once tasks are assigned to executors, they are executed in parallel on worker nodes.
Summary of Spark Execution Architecture
Job Creation: A Spark job is created when an action is called on an RDD/DataFrame, triggering the creation of a DAG of stages.
Stages Division: The DAG Scheduler divides the job into stages based on shuffle boundaries (wide transformations like groupByKey).
Tasks Division: The Task Scheduler splits each stage into tasks, one for each data partition in the RDD/DataFrame.
Task Scheduling: The Task Scheduler assigns tasks to executors based on available resources and data locality.
Task Execution: Executors run tasks, process data, store intermediate results, and send final results back to the driver.
Thankyou, Please share your solution in the comment section below.
Comentarios