Understanding Big Data Solutions #1: Apache Spark

Anna C S Medeiros
9 min readMay 8, 2024

--

If you are here you are probably interested in improving your developing environment with Spark. Well, sit tight, I am here to show you how Spark works and how it compares to others - so you can make your choice wisely. Let's Begin.

Photo by Morgan Sessions on Unsplash

Table of Contents

Origins

Spark started as a research project at the UC Berkley AMPLab in 2009, became open-source in 2010, and was donated to the Apache Software Foundation in 2013. [source]

The initial motivation came from the need to support distributed machine learning applications. [source]

The main author, Matei Zaharia, who was working on Apache Mesos at the time, recognized the potential for a new framework that could leverage Mesos to facilitate the rapid and efficient development of distributed applications, particularly for machine learning.

This led to the development of Spark, which combined the goals of proving Mesos's capabilities and addressing the practical needs of building scalable machine learning applications.

Spark eventually evolved into a powerful, general-purpose distributed data processing engine widely used across various industries.

What is Spark?

Apache Spark is an open-source distributed computing framework for efficiently processing large amounts of data. It is also officially described as a unified engine for large-scale data analytics. It has been widely adopted as a flagship project of the Apache Software Foundation.

Some characteristics are:

  • Speed: Spark provides an efficient framework for in-memory data processing, which can be significantly faster than disk-based processing (Ex. MapReduce) for certain applications, particularly iterative algorithms common in machine learning and data mining.
  • Ease of Use: Spark offers high-level APIs in Java, Scala, Python, and R, making it accessible to a wide range of users. It also supports SQL queries, streaming data, machine learning, and graph processing, which can all be used seamlessly within the same application.
  • Modularity: Spark is designed as a modular system with various components like Spark SQL for structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for real-time analytics.
  • Scalability: Designed to scale up from a single server to thousands of machines, each offering local computation and storage. This allows Spark to efficiently handle both small and large-scale data processing tasks.
  • Fault Tolerance: Spark is highly fault-tolerant through its advanced directed acyclic graph (DAG) execution engine, which supports cyclic data flow and in-memory computing. If any part of the DAG processing fails, Spark can restart the failed operations on a different node to ensure the processing completes.
  • Real-time Processing: Spark Streaming enables real-time data processing. It incorporates streaming data into the same application alongside batch and interactive queries.
  • Interoperability: Spark can easily integrate with Hadoop and its data formats, including HDFS, HBase, and Hive. It can also read data from other data sources like Cassandra, AWS S3, and more.
  • Cluster Management: Spark can run on various resource managers, including Hadoop YARN, Apache Mesos, Kubernetes, or Spark’s own cluster manager.

Spark Components

Apache Spark’s architecture consists of several major components. Specifically, they are the Driver Program (commonly known as the Driver), the Executor in the Worker Node, and the Cluster Manager.

https://spark.apache.org/docs/latest/cluster-overview.html#components

Driver Program

The Driver executes the main function of the program created by the user and manages the Executor.

  • Driver generates SparkContext (SparkSession) and runs and manages Spark.
  • Driver converts user programs into units of execution (tasks) in Spark. A task is the smallest unit of execution in Spark. (described later)
  • Schedule tasks on Executor. The Executor process registers itself with the Driver at startup. This allows the Driver to manage the Executor.

Executor

The Executor will be the process inside the Worker Node on the right side of the diagram above. It can execute tasks and retain data.

  • A worker process that is responsible for executing individual tasks in a Spark job. Execute this task and return the result to the Driver.
  • Executors have in-memory storage and provide it through a service called a block manager.

Cluster Manager

The Cluster Manager is in the center of the diagram above. It manages the resource status of each Worker Node and assigns the necessary resources in response to the Driver's requests.

Spark Software Stack

Apache Spark consists of multiple software layers. The most basic part is an execution engine called Spark Core. Based on this engine, various advanced features are provided.

Overview of Spark Stack
  1. Spark Core: This is the heart of Spark. It manages task dispatching, scheduling, and basic input/output operations, crucial for memory management and fault recovery. Spark Core supports programming across multiple languages with a comprehensive API and enables in-memory computations on large clusters through resilient distributed datasets (RDDs). It also optimizes data distribution and aggregation via broadcasting variables and accumulators.
  2. Catalyst Optimizer: It processes user code using DataFrames and Datasets built on RDDs to create optimized execution plans. Transforming logical plans and employing techniques like predicate pushdown and projection pruning, minimizes data shuffling and computational overhead.
  3. Standalone/YARN/Mesos/Kubernetes: Spark supports several cluster managers for different operational needs. ‘Local’ mode runs Spark on a single machine and is mostly used for initial testing. The Spark Standalone scheduler is an easy-to-use cluster manager included within Spark itself, suitable for managing smaller clusters. YARN is widely used for larger clusters and offers deep integration with Hadoop ecosystems, providing a more dynamic allocation of resources. Mesos, which was a flexible option compatible with various data services, is deprecated as of Spark 3.2 and is no longer recommended. Kubernetes has gained popularity for managing containerized Spark applications, allowing scalable and efficient orchestration of Spark jobs within cloud environments.
  4. Spark SQL/Streaming/MLib/GraphX: Specialized libraries for diverse data processing tasks. Spark SQL manages structured data. Spark Streaming facilitates real-time, fault-tolerant data processing. MLib offers scalable machine learning tools for algorithms across classification, regression, and clustering. GraphX leverages RDDs for efficient graph computations, useful in network analysis and ranking algorithms.

Here we emphasize again that Spark is a versatile solution for big data analytics, real-time processing, machine learning, and graph analysis. As you might imagine, there are distributed computing frameworks that focus on specific types of data, maybe even offering optimized performance for those particular use cases.

Spark RDD (Resilient Distributed Dataset)

RDDs are a collection of partitions. Partitions and tasks have a 1:1 relationship
  1. RDD (Resilient Distributed Dataset): RDDs are Spark’s core abstraction! The RDDs are a collection of Partitions. Other APIs like Datasets and Dataframes build on top of RDDs.
  2. Partitions are the smallest unit of data to be processed in SPARK. RDDs are distributed as partitions on each Worker Node, allowing data to be processed in parallel and allowing for faster calculations. Each partition is processed by a task within the Executor. Partitions and tasks have a 1:1 relationship. The size is 128MiB by default, but it can be changed.

Partitioning during shuffle

In Apache Spark, certain operations require a shuffle, where data must be redistributed across different partitions. This process involves extensive data movement and can significantly impact performance, so be careful.

Examples of operations with and without shuffling
  • Shuffling Operations: These operations require data from potentially all partitions to be rearranged, causing a redistribution of data across the cluster. This is necessary for tasks that aggregate data, such as joining different datasets based on a key or grouping data to perform aggregations.
  • Non-Shuffling Operations: These operations are generally more performance-efficient as they do not involve moving data across partitions. They can be performed independently within each partition of the dataset.

Understanding and minimizing shuffling can lead to more efficient Spark applications, especially for large-scale data processing tasks.

How to explicitly split

If the data is highly unbalanced and the processing time of some tasks is long, or if you want to increase the overall number of tasks in parallel for faster processing, rearrange the partitions to make adjustments.

df.coalesce(2)

coalesce relocates between partitions within the executor. No shuffling occurs.

Use coalesce() when reducing the number of partitions to avoid the cost of shuffling and when the existing partition distribution is already adequate.

df.repartition(2)

repartition relocates partitions laterally. A shuffle occurs.

Use repartition() when you need to thoroughly shuffle your data and optimize for operations that might benefit from a uniform distribution of data across partitions or when adjusting the number of partitions significantly. Remember that all repartition does is call coalesce with the shuffle parameter set to True [source].

Actions and Transformations

Examples of transformations and action functions
  • Transformation: operations that create new RDDs from existing ones without immediately computing the results. Instead, Spark constructs a Directed Acyclic Graph (DAG) of these transformations, which it uses to track the lineage of each RDD. This process, known as lazy evaluation, means that the transformations are not executed when they are defined but are deferred until an action necessitates their execution. This approach enhances fault tolerance by enabling Spark to efficiently rebuild parts of the data lineage if any segment of the data process is lost or corrupted, as it only needs to recompute the affected transformations.
  • Actions: Actions in Spark trigger the computation of the transformations accumulated in the DAG and return a result to the driver program or store output data externally. Actions are the point at which Spark’s lazy evaluation model becomes eager, as they compel the execution of the series of transformations required to deliver the requested data. When an action is called, Spark evaluates the DAG and processes the data through the transformation chain. If a failure occurs, such as a node going offline, Spark can recover by re-evaluating the DAG from the point of failure, leveraging the resilient nature of RDDs to restore the computation without starting from scratch.

Catalyst Optimizer

This is a core feature of Spark SQL. Several phases exist before query execution, including logic optimization, physical planning, and Java bytecode generation. It also supports rule-based and cost-based. Catalyst Optimizer is designed with the following objectives in mind:

  • Facilitating optimization techniques and yesterday’s additions to Spark SQL
  • Allow external developers to extend Optimizer. (Addition of rules specific to data sources, support for new data types, etc.)
https://www.databricks.com/glossary/catalyst-optimizer

Analysis Step

  • Generate abstract syntax trees such as DataFrame queries
  • At this step, column and table names are resolved by referring to the internal catalog.

Logical Optimization Step

  • Consists of two internal stages.
  • Build multiple plans using a standard rules-based optimization approach
  • Assign costs to each plan using CBO (Cost Base Optimizer).

Physical Planning Step

  • Generate the optimal Physical Plan for the selected Logical Plan.

Code Generation Step

  • Generate efficient Java bytecode to run on each machine.
  • Generate all stages of code*1 using Project Tungsten.

*1 all-stage code generation: A physical query optimization collapses the entire query into a single function, eliminates virtual function calls, and employs CPU registers for intermediate data.

MapReduce vs Spark

MapReduce Issues

  • Performance: Map Reduce stores intermediate data on disk, so disk I/O can be a bottleneck in processing speed, especially if you’re dealing with huge datasets.
  • Flexibility: You get your Map, you get your Reduce, and that’s mostly it. Custom operations can be a pain. Contrasting with Spark — it has full libraries for different needs like SQL, streaming, machine learning (MLlib), and graph processing (GraphX).
  • Learning cost: Jobs need to be written in Java or Python, and the learning cost is high for beginners.

Spark was used in the winning solution of the 2014 Daytona Gray Sort 100TB Benchmark. The previous world record was set by a Hadoop MapReduce cluster.

This master thesis (2015) compares the performance of MapReduce 1.0 with Spark 1.3, showing the speed-up Spark provided since early versions.

Thanks for reading!

Hope I helped you better understand Spark in the Big Data Ops scenario.

Give it some claps to make others find it, too! Also, Make sure you follow me on Medium to not miss anything. Let’s connect on LinkedIn.

--

--

Anna C S Medeiros
Anna C S Medeiros

Written by Anna C S Medeiros

Senior Data Scientist @ Vsoft | GenAI | Computer Vision | NLP | LLM

No responses yet