RDD or Resilient Distributed Dataset is the fundamental data structure of Spark. It can be considered Spark’s main programming abstraction and resides in Spark Core component. RDD is a collection of items distributed across many cluster nodes that can be manipulated in parallel. Also note that Spark’s RDDs are by default recomputed each time you run an action on them.
In order to see RDD in action, we need to have Spark component install either standalone on our machine or on a cluster. For this tutorial, I am using Spark installation on my Hortonworks sandbox. How to do this installation can be found here.
Once you have installed HDP sandbox, we can get started with exploring RDDs. There are 2 types of operations we can perform on RDDs – transformations and actions. Transformations are operations which will act on RDD and return a new RDD. Actions on the other hand will run the computation on dataset and return value to the driver program. Note that transformations are lazyily evaluated, means no operation actually takes place unless an ‘action’ takes place.
Some of the transformations that can be performed on RDD are below:
Map(func) – maps each item in RDD to a different item based on the input function.
Flatmap(func) – similar to map, but each element can be mapped to 0 or more output items.
Filter(func) – removes some items based on condition inside the input function
Distinct() – removes duplicates
Some examples of actions are:
Collect() – returns all the elements of dataset in an array. Good if the dataset is small enough to fit into memory, good for testing purposes
Count() – returns the count of total number of elements in the RDD
Reduce(func) – aggregates the elements of RDD using a function which is passed as parameter
First() – returns the first element of dataset.
Take(n) – returns an array with first n elements of dataset. N is given as input parameter.
SaveAsTextFile(path) – saves the resultant dataset into a text file at path given as input parameter.
In order to try these transformations and actions out on Spark, follow below steps. We will use python for this tutorial, but as mentioned in previous blog, Spark supports R, Java, Scala and Python for development. However shell feature is provided only for python (PySpark) and Scala (spark-shell) languages.
SSH into your HDP sandbox from your terminal. I have installed HDP sandbox on my local MacOS machine.
$ssh firstname.lastname@example.org -p 2222
Locate your spark program and ‘cd’ into it. For me, it’s located at /usr/hdp/126.96.36.199-8/spark2
Start program pyspark
This should open pyspark program and you’ll see ‘pyspark>>’ on console.
Create SparkContext using below steps :
1 2 3
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("RDD Tutorial").setMaster(master) sc = SparkContext(conf=conf)
Now we will create a collection and created a distributed dataset of that collection using SparkContext’s parallelize method.
data = [1, 2, 3, 4, 5] distData = sc.parallelize(data)
Since distData is a parallel collection (means it can be operated upon in parallel), we can also define how many partitions it needs to be cut in to. This can be the second parameter to parallelize method, like sc.parallelize(data, 10) for 10 partitions. If not provided, Spark automatically calculates the number of partitions. Ideally there should be 2-4 partitions per CPU of the cluster.
Try out more operations on this distData RDD.
distData = distData.map(lambda a: a + 3) distData.collect()
This will map each element and increment it by 3. Resultant RDD is [4, 5, 6, 7, 8]. Remember that currently above transformation will not be called, and will only occur when we execute second line, the ‘collect()’ action because RDD transformations are lazily evaluated. Second line should print the resultant RDD on console.
Similarly, try other transformations and actions.
distData = distData.filter(lambda a : a > 5) distData.collect()
This prints [6, 7, 8]
Sum = distData.reduce(lambda a, b : a + b)
This should print sum as 21 on the console (sum of 6 + 7 + 8 = 21)
Prints [6, 7]
Saves result at specified path
Similarly transformations can also be performed on pair RDDs, which are RDDs having key, value pair as elements of the dataset. An example of a pair RDD would be [(a, 1), (a, 2), (b, 3), (b, 4), (c, 5)].
Sample transformations on one pair RDD of elements (K, V):
ReduceByKey(func) – returns RDD of (K, V) pairs where V is the value calculated using the func aggregate method for each key.
GroupByKey() – returns RDD of (K, Iterable
SortByKey() – returns RDD where elements are sorted by keys K.
MapValues(func) – map values V for each key K
Keys() – returns RDD of just the keys
Values() – returns RDD of just the values
In order the try then on console, follow steps 1 to 4 as mentioned above, then follow below steps.
1 2 3 4
data = [(a, 1), (a, 2), (b, 3), (b, 4), (c, 5)] distData = sc.parallelize(data) Result1 = distData.reduceByKey(lambda a, b : a + b) Result1.collect()
This should return [(a, 3), (b, 7), (c, 5)]
Result2 = distData.groupByKey() Result2.collect()
This should return [(a, [1, 2]), (b, [3, 4]), (c, )]
Result3 = distData.sortByKey() Result3.collect()
This should return distData as is because keys are already sorted.
Result4 = distData.mapValues(lambda a : a + 1) Result4.collect()
This should return dataset where all values are incremented by 1, that is, it returns [(a, 2), (a, 3), (b, 4), (b, 5), (c, 6)]
Result5 = distData.keys() Result5.collect()
This should return [a, b, c].
Result6 = distData.values() Result6.collect()
This should return [2, 3, 4, 5, 6].
There is more to RDD, which we will cover in further blogs of Spark series.