Installing a standalone Apache Spark cluster

01

In this post, I’d like to share my experiences about architecting and provisioning an Apache Spark cluster via scripts.

In these days, I’m working hard with customers for provisioning clusters and running machine learning algorithms in order to discover the most accurate predictive model on hundreds of billions of data rows. Because of the big amount of data rows, then the cluster tends to have several nodes; so we need to have some kind of scripts (in my case, for bash shell in Ubuntu 14.04) to automate and streamline (reduce working time) the creation of the cluster due to the number of nodes.

Conceptually, a standalone Apache Spark cluster is similar to other distributed architecture that I’ve designed for different technologies: Apache Mesos, load balancer+app/http_servers, MPICH, Hadoop file system/YARN, etc.

I mean we have a master node (sometimes a secondary node for service availability coordinated each other by ZooKeepers) and several worker nodes.

The master is the orchestra leader that coordinates the work to get done, deals with some possible failures and recovers, and finally gathers the results.

The worker nodes do the real hard work, data crunching and/or predictive model creation.

The communication channel between the nodes in the cluster is done using SSH.

A common deployment topology is like this:

  • Master node:
    • Apache Spark viewpoint: We have the cluster manager that launches the driver programs (program with the data processing logic) and coordinates the computing resources for its execution. It’s like OS scheduler but at a cluster level
    • Hadoop FS viewpoint: We have the name node with the file system metadata. It contains the information of the tree structure and inodes and it knows where each data block resides physically in the cluster (in which data node)
  • Worker nodes:
    • Apache Spark viewpoint: We have the executors. This is the runtime environment (computing resources container) where the tasks associated to the driver program logic are really executed. In this case, the tasks are executed in parallel (several tasks in the same node) and distributed (tasks on different nodes)
    • Hadoop FS viewpoint: We have the data nodes. This is where the data blocks (each files is divides in several data blocks) physically reside. Data blocks are distributed in the cluster for scalability/high performance and replicated by default at least in 3 nodes for availability (of course, we need at least 3 nodes in the cluster for this). Each data block is by default 64MB of size.

I can visualize the former deployment topology using the following diagram as shown in the figure 01.

02

Figure 01

These are the steps that I follow to create a cluster:

Step 01. In all the nodes, I configure the DNS (/etc/hosts) to associated hostname and IP correctly

Step02: In all the nodes, I create a user for running Spark and Hadoop under this security context. Let’s say sparkuser. As well as, I install the dependencies: Java and Scala

Step03. In the master node, I configure SSH login without password for root and sparkuser onto the worker nodes. For root user, it makes easy the automated provisioning. For the sparkuser user it’s required for running the Spark cluster. This is done using PKI mechanism and there are a lot of docs related

Step04. In all the nodes, I install and configure the base Hadoop as shown in the listing 01.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

echo “—– Downloading and installing Hadoop 2.7.1 in the directory $HADOOPBASEDIR/hadoop”

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

tar zxvf hadoop-2.7.1.tar.gz

rm -fr hadoop-2.7.1.tar.gz

mv hadoop-2.7.1 $HADOOPBASEDIR

mv $HADOOPBASEDIR/hadoop-2.7.1 $HADOOPBASEDIR/hadoop

echo “—– Setting environment variables for Hadoop 2.7.1”

echo -e “\n# Setting environment variables for Hadoop 2.7.1” >> /etc/profile

echo “export HADOOP_HOME=$HADOOPBASEDIR/hadoop” >> /etc/profile

echo ‘export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin’ >> /etc/profile

echo ‘export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop’ >> /etc/profile

echo “—– Reloading environment variables for Hadoop 2.7.1”

source /etc/profile

echo “—– Configuring Hadoop 2.7.1”

echo “—– Setting the hadoop-env.sh file”

sed -i ‘s/${JAVA_HOME}/\/usr\/lib\/jvm\/java-8-oracle/g’ $HADOOPCONF/hadoop-env.sh

echo “—– Setting the core-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/core-site.xml

sed -i ‘$ d’ $HADOOPCONF/core-site.xml

cat >> $HADOOPCONF/core-site.xml <<EOF

<property>

<name>fs.defaultFS</name>

<value>hdfs://$SPARKMASTER:9000</value>

</property>

</configuration>

EOF

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

Listing 01

Step04a. In the master node, I configure the name node as shown below in the listing 02.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

SPARKDATADIR=$HADOOPBASEDIR/hadoop/hadoop_data/hdfs/namenode

#WORKERNODES

echo “—– Configuring Hadoop platform on the master node $SPARKMASTER”

echo “—– Creating directory”

mkdir -p $SPARKDATADIR

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

echo “—– Setting the hdfs-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/hdfs-site.xml

sed -i ‘$ d’ $HADOOPCONF/hdfs-site.xml

cat >> $HADOOPCONF/hdfs-site.xml <<EOF

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file://$SPARKDATADIR</value>

</property>

</configuration>

EOF

echo “—– Setting the masters nodes”

echo “$SPARKMASTER” >> $HADOOPCONF/masters

echo “—– Setting the slaves nodes”

sed -i ‘$ d’ $HADOOPCONF/slaves

for worker in $(cat conf/spark.worker.nodes.conf)

do

echo “$worker” >> $HADOOPCONF/slaves

done

Listing 02

Step04b. In the worker nodes, I configure the data nodes pointing correctly to the name node as shown below in the listing 03.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

SPARKDATADIR=$HADOOPBASEDIR/hadoop/hadoop_data/hdfs/datanode

echo “—– Configuring Hadoop platform on the worker nodes”

echo “—– Creating directory”

mkdir -p $SPARKDATADIR

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

echo “—– Setting the hdfs-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/hdfs-site.xml

sed -i ‘$ d’ $HADOOPCONF/hdfs-site.xml

cat >> $HADOOPCONF/hdfs-site.xml <<EOF

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file://$SPARKDATADIR</value>

</property>

</configuration>

EOF

Listing 03

Step05. In all the nodes, I install and configure the base Spark as shown in the listing 04.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

SPARKBASEDIR=$(cat conf/deploy.dir.conf)

echo “—– Downloading and installing Spark 1.4.1 in the directory $SPARKBASEDIR/spark”

wget http://apache.mirrors.tds.net/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz

tar zxvf spark-*

rm -fr spark-1.4.1-bin-hadoop2.4.tgz

mv spark-* $SPARKBASEDIR

mv $SPARKBASEDIR/spark-* $SPARKBASEDIR/spark

echo “—– Setting environment variables for Spark 1.4.1”

echo -e “\n# Setting Environment Variables for Spark 1.4.1” >> /etc/profile

echo “export SPARK_HOME=$SPARKBASEDIR/spark” >> /etc/profile

echo ‘export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin’ >> /etc/profile

echo “—– Reloading environment variables for Spark 1.4.1”

source /etc/profile

echo “—– Granting rights for the user $SPARKUSER on the Spark directory $SPARKBASEDIR/spark”

chown -R $SPARKUSER $SPARKBASEDIR/spark

echo “—– Configuring Spark 1.4.1 – File: spark-env.sh”

cp $SPARKBASEDIR/spark/conf/spark-env.sh.template $SPARKBASEDIR/spark/conf/spark-env.sh

cat <<EOF >>$SPARKBASEDIR/spark/conf/spark-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export SPARK_PUBLIC_DNS=$SPARKMASTER

export SPARK_WORKER_CORES=6

EOF

Listing 04

Step05a. And finally, in the master node, I do the configuration required to point to the worker nodes who will do the actual tasks/jobs as shown in the listing 05.

#!/bin/bash

# Parameters for settings

SPARKMASTER=$(cat conf/spark.master.node.conf)

SPARKBASEDIR=$(cat conf/deploy.dir.conf)

#WORKERNODES

echo “—– Configuring Spark platform on the master node $SPARKMASTER”

echo “—– Setting the slaves nodes”

for worker in $(cat conf/spark.worker.nodes.conf)

do

echo “$worker” >> $SPARKBASEDIR/spark/conf/slaves

done

Listing 05

Using the former scripts, I can automatically provision a standalone Apache Spark cluster with any number of worker nodes.

You can use it in your own provisioning and configuration, and I really appreciate if you let me know your thoughts/feedback/comments in order to improve it with your own scenario.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s