I’ve working heavily with the duet Apache Spark (for computing) and HDFS (as the distributed storage) for processing a very large volume of data. My customers want to get insights about their data and Apache Spark is a great tool for it. Using Apache Spark, I can crunch big data and compute statistics and generate predictive models using a bunch of machine learning libraries.
In this post, I want to show how simple is to compute statistics using Apache Spark and Scala.
The Apache Spark programming model is very simple. Our data is encapsulated in a RDD (Resilient Distributed Dataset). RDD is a monad that can be transformed by passing functions to its high-order functions (map, filter, foreach, countByValue, etc). Monads, functions and high-order functions are key concepts of functional programming and they’re well supported by Scala. It’s remarkable to say that RDD splits the data into partitions (data chunks) and distributes them onto the nodes of the cluster to process objects within a partition in sequence and multiple partitions in parallel. When we create a RDD, no execution takes place yet on the cluster; because RDD defines a graph of logical transformations over the dataset and the real execution takes place when we call an action (collect, foreach, counting, etc).
Let’s supposed that we have a big data file (aprox 1GB) with the following structure as shown in the listing 01. The last attribute (is_match) is a categorical one and the target. The remainder attributes are the features. We’re going to process the first (integer), second (integer) and the seven (optional double) ones.
Now let’s start writing code to process this data file. First of all, we need a representation of our data using a case class in Scala as shown in the listing 02. In this case, the data point has three features and its target. We also need a way to convert optional double to an internal representation using the toDouble function.
Next step is process the data file and get some statistics.
The very first step is to create a SparkContext object representing the communication with the cluster as shown in the listing 03.
Then we need to create a RDD from the SparkContext.textFile method to point the data file. Apache Spark can access to the local file system, distributed file system such as HDFS and Amazon S3 as well as databases such as Cassandra and HBase. When the RDD is created, then we can operate with transformations (transform the RDD into a new RDD) and actions (return a result to the driver program and trigger the real execution).
In our example, we filter out the header and for each line, split by the field separator and create instances of DataPoint case class. Finally, we persist the RDD in memory for reusing in the future. The code is shown in the listing 04.
Next step is to create a histogram related to the target attribute (a categorical one). We’re going to use the countByValue method with a projection on the target property. This is an efficient implementation of the aggregation because when aggregating large dataset distributed in a cluster we’re facing with the network as the key bottleneck, so we need an efficient mechanism for moving the data over the network (serializing, sending over the wire and deserializing). The output is represented as Map[T, Long]. The code is shown in the listing 05.
If we want to get a histogram over a continuous variable, we need to use another way. We have to get a projection on the fea03 property. Then filter out the missing values and finally use the stats method as shown below in the listing 06.
In this post, I’ve shown how we can use Apache Spark for computing statistics for a very large dataset. You can apply this ideas and knowledge to your own scenarios.