Apache Spark is an open-source cluster computing framework with a fast in-memory data processing engine. It is multiple times faster than MapReduce and provides libraries for development in R, Python, Scala and Java. It provides streaming, SQL, Machine Learning and graph processing capabilities. It can run on Hadoop, Mesos, standalone or in the cloud and also supports data access from variety of sources like HDFS, S3, Cassandra and HBase.
Here is the illustration showing Spark components inside of Spark and its deployment in various forms
RDD or Resilient Distributed Dataset is the fundamental data structure of Spark. It can be considered Spark’s main programming abstraction and resides in Spark Core component.
RDD is a collection of items distributed across many cluster nodes that can be manipulated in parallel.
Also note that Spark’s RDDs are by default recomputed each time you run an action on them.
In the upcoming blogs we will cover the components of Spark and RDD in greater details with examples.