Spark 2.0 Concepts
Tue Oct 11 2016
5 min read
Cumilla, Bangladesh
What is spark: Data processing engine for ease of use, speed and advanced analytics on big data.
Spark key terms:
- Spark Cluster
- Spark Master
- Spark Worker
- Spark Executor
- Spark Driver
- Spark Sessions and Spark Contexts
- Spark Apps, Jobs, Stages and Tasks
Brief Overview:
- A cluster consists of the above stated elements.
- Master controls the workers.
- Workers register with Masters.
- Workers have executors. Executors execute Tasks.
- Driver is the main man of the Master. It learns from the Master about the Workers. Then distributes tasks to workers and receives results. Driver does all the dirty works for the Master.
- Contexts contain all the related information about managing communication between spark components.
- Spark Session is a simpler API for Spark Contexts.
- Apps are spark applications.
- Jobs are actions by user or apps.
- Jobs are divided into stages.
- Stages are divided into tasks.
- Driver distributes tasks to Executors on Worker nodes.
- Parallel execution of multiple task is possible on the same Executor on different partitioned datasets.
Spark Core: Deferred for later.
Dataframes, Datasets, Spark SQL, RDD:
- RDDs are Resilient Distributed Datasets.
- Dataframes and Datasets are high level APIs on top of RDDs for structured or semi-structured data.
- Datasets are safer and have compile time check for syntax and analysis errors.
- For structured or semi-structured datasets, Datasets or Dataframes are the way to go. They are faster and efficient than RDDs. Also can use a wide range of language on them where for RDDs you have to use scala.
- Datasets and Dataframes can be easily converted to RDDs.
Graph processing with GraphFrames:
- Graphs can be processed with GraphFrames or GraphX.
- GraphFrames are more developer friendly.
- You can create graphframes from dataframes that contain vertices and edges.
- A wide range of analytics and algorithms can be applied on GraphFrames.
Stuctured Streaming with infinite Dataframes:
- Streams can be pipelined to existing models.
Machine Learning:
- With MLlib a wide range of models can be used with ease.
Summarised from here