Apache Spark Overview

Like it? Share...Share on FacebookPin on PinterestTweet about this on TwitterShare on LinkedInShare on Google+Email this to someone

Apache Spark is an open-source cluster computing framework with a fast in-memory data processing engine. It is multiple times faster than MapReduce and provides libraries for development in R, Python, Scala and Java. It provides streaming, SQL, Machine Learning and graph processing capabilities. It can run on Hadoop, Mesos, standalone or in the cloud and also supports data access from variety of sources like HDFS, S3, Cassandra and HBase.

Here is the illustration showing Spark components inside of Spark and its deployment in various forms

SparkOverview

RDD or Resilient Distributed Dataset is the fundamental data structure of Spark. It can be considered Spark’s main programming abstraction and resides in Spark Core component.
RDD is a collection of items distributed across many cluster nodes that can be manipulated in parallel.
Also note that Spark’s RDDs are by default recomputed each time you run an action on them.

In the upcoming blogs we will cover the components of Spark and RDD in greater details with examples.

Like it? Share...Share on FacebookPin on PinterestTweet about this on TwitterShare on LinkedInShare on Google+Email this to someone

1 thought on “Apache Spark Overview”

  1. Pingback: Spark RDD Overview and Hands-on – HemaOffDuty

Leave a Reply