Cognitive Computing and Math for Increasing your Sales

Reading time: 20 minutes

figure00

In this post, I’d like to share some ideas, concepts, thoughts and a product I’ve been developing for several months in order to get some feedbacks, comments and thoughts from you.
I’ve been spending a lot time on thinking how to combine new technology trends (in particular cognitive computing, data science, big data, stream-based processing) and math in order to help businesses increase their sales. This is the holy grail of every business.
So, if my proposal resonates after sharing, please feel free to let me know your thoughts.

Every business should receive feedback from the customers in order to improve the processes and products and make the customers happier. Happy customers lead to more sales (with new sales, cross-selling and up-selling) and referrals to other potential customers.
At the end of the day, if you know what’s really important to your customers and you measure their experience (quality, service, environment, etc); then you can increase the customer’s engagement and finally increase your own sales.
A rule of thumb could be: “better feedbacks implies adapting your business and products that implies increasing customer’s loyalty that finally implies more sales“. One step further is getting and measuring feedbacks in real-time, so you can solve issues while the customer is still interacting with your business.

For example, hospitality and lodging industry is the one that lives and dies by the perception of customers and its own reputation by further comments. It’s well know that increasing the guests loyalty implies increasing the referrals and the ranking in social networks such as TripAdvisor and impacting positively on the revenue. So, it’s very important to hear and know what guests think about your business regardless if you’re the CEO, the revenue manager, the marketing manager or just the front-desk man.

So, we have key questions such as:

  • How do I reach my customers in an easy and cost-effective way? Today, there are a lot of channels to reach the customers. Some are cheap (such as my web site, email, twitter, etc) while others are very expensive (TV ads, Super Bowl ads, etc).
  • How do I reach my customer almost real-time in order to act pro-actively? I don’t want to wait several days for a phone interview. I’d like to know what’s thinking now or at least in the short-term.
  • How do I gather and measure objectively what the customer thinks about? I need to use math and cognitive computing in order to measure objectively the customer’s experience by filtering out some subjectivity. As well, I need to make comparable different measurements/experiences in order to get correct insights.
  • How do I influence my customers to send back feedbacks? For most of the customers, filling in old-fashion paper comment cards is not so good; so we need to design sexy and easy-to-understand comment cards to motivate the customers to fill it in.

The proposed solution is an information technology (for reducing cost and time to reach the customer) that displays an intuitive and sexy digital comment card (reducing paper costs and being eco-friendly) that helps to gather surveys easily and computes automatically the captured data (reducing computing time to get a result) and finally provides insights about your business for increasing the sales.

First of all, the product enables defining a questionnaire (a list of questions) for the digital comment card. When the digital comment card is sent to the customer, then five questions are randomly selected from the questionnaire and displayed to the customer in a sexy and easy-to-understand way as shown in the figure 01.

figure01
Figure 01

Then the customer select according to his/her experience. The product internally uses a math algorithm to convert the survey response into an actionable and objective index.
The idea is to assign a weight to each face and multiply by its distribution, so mathematically the index = sum(face_dist*face_weight).
Let’s suppose that we assign the weights as: bad_face_weight=0, neutral_face_weight =50 and happy_face_weight =100 and see some scenarios:

  • If every response is bad, then we have the distribution bad_face_dist=100%, neutral_face_weight =0%, happy_face_weight =0% and the index is equal to 0*1 + 50*0 + 100*0 = 0 (the lowest index value)
  • If every response is good, then we have the distribution bad_face_dist=0%, neutral_face_weight =0%, happy_face_weight =100% and the index is equal to 0*0 + 50*0 + 100*1 = 100 (the highest index value)
  • If every response is neutral, then we have the distribution bad_face_dist=0%, neutral_face_weight =100%, happy_face_weight =0% and the index is equal to 0*0 + 50*1 + 100*0 = 50 (a medium index value)
  • In the real world, we have a different distribution and index values between 0 and 100. Close to the 0 index is bad, close to the 100 index is good and around the 50 index is neutral

A key idea is to present a dashboard showing the number of responses related to the digital comment cards and the underlying index calculation. As well as, the product shows the calculated index for each question in the questionnaire. This is important to take actions correctly given the feedback per question and per whole comment card. You can see an example of dashboard in the figure 02.

figure02
Figure 02

It’s remarkable to say that the digital comment card may be accessed using different methods:

  • A link embedded in an email, newsletter, Twitter message, Facebook post, etc
  • Any device such as laptop, desktop, tablet, smartphone because it’s designed to be responsive (adaptable to any display device). By this method, you can locate a tablet in the front-desk, so the customer can respond the surveys while being in your business
  • A QR code posted in your billboard, printed in a paper/newspaper/magazine or just send by any digital method

Finally, the product uses cognitive computing to discover insights from the customer comments. The product uses particular the technique named sentiment analysis part of the natural language processing discipline. It consists in analyzing a human-written text/comment and gives as a result the polarity of the text: positive (good for your business), negative (bad for your business) and neutral (neither good or bad) and set of concepts related to the comment.

Let’s see a comment as shown in the figure 03.

figure03
Figure 03

We can discover several things in this comment as shown in the figure 04 such as the most important words related to the comment.

figure04
Figure 04

And the comment sentiment is shown in the figure 05.

figure05
Figure 05

The product enables to configure an email address for the case when we receive a negative comment in order to send an instant alert and act as fast as possible to solve any problem and improve the customer’s satisfaction.

I’m also thinking to include in the product the capability to discover customer segments using as an input the comments. I think to integrate the product with IBM Watson Personality Insight technology that extracts and analyzes a spectrum of personality attributes to discover the people personality. It’s based on the IBM supercomputer that famously beats humans at Jeopardy!.

Well, after sharing the key concepts and ideas of a product I’m developing, I’d like to hear from you if the problem/solution resonates to you, and if you’d like to participate in the beta testing phase.

You can reach at me: hello@nubisera.com and feel free to let me know your thoughts.

Juan Carlos Olamendy

CEO and Founder Nubisera

Nubisera Web Site

Reach Juan at juanc.olamendy@nubisera.com

Advertisements

A strategy for avoiding overfitting

00

By: Juan Carlos Olamendy Turruellas

When we build a predictive model, the key objective is the accuracy. Accuracy means that the model not only applies correctly to the training dataset but to the unseen dataset. When a model tends to predict/fit very good the training dataset (100% of accuracy) but tends to fail with unseen dataset (50% of accuracy), then it’s said that the model has overfit. Overfitting is a phenomenon that we need to deal with because all machine-learning algorithms have a tendency to overfit to some extent.

Researching on several books and Internet, I’ve discovered some patterns that allow me to formulate a strategy to avoid overfitting. I want to share this strategy for receiving feedbacks and improve it. In order to make it clear, I’ll materialize the concepts using some codes in Apache Spark and the spark-repl for running interactive commands.

The strategy can be visualized in the following roadmap in the figure 01.

01

Figure 01 

– Split randomly the dataset

The first step is to randomly split the original dataset into three non-overlapping datasets:

  • Training dataset: to create predictive models
  • Cross-validation dataset: to evaluate the parameters performance of the predictive models created using the training dataset
  • Holdout dataset: to evaluate the performance of the best predictive models, and to measure how these models generalize correctly. These true values are hidden from the creation process of the predictive model and predicted values

We can see an example using Scala in Spark as shown in the figure 02.

02

Figure 02

– Build models

After that, we need to build predictive models using different values for the underlying parameters. In our example, we’re building a classifier using decision tree algorithms. We need to evaluate different options (for this kind of algorithm) for entropy, tree depth and bins (for more details, you can read the implementation in the Spark documentation). The output is a list of the tuple (model, precision, impurity, depth, bin).

The precision indicates how much the model is overfitting or generalizing by outputting a measure of true positive versus false positive. The higher value, the better accuracy.

It’s remarkable to say that these models are built using the training and cross-validation data set.

I illustrate the concepts using the following code as shown in the figure 03.

03

Figure 03

– Get top performance parameters

Next step is to see the parameters ordered by the best precisions as shown in the figure 04.

04

Figure 04

We can see for the case of decision trees the depth is what affects negatively on the predictive model accuracy. The less depth, the less accuracy.

Then, we get and print the top 5 performance parameters as shown below in the figure 05.

05

Figure 05

– Build and evaluate model with top performance parameters

Next step is to evaluate how accurate the best predictive models perform regarding to the hold-out/testing dataset. This step shows if the predictive models are really overfitting/memorizing the training data set or not.

We need to create a new training dataset comprising the former training and the cross-over datasets. We use the best parameters from the point of view to tune the model creation process.

The step is illustrated using the Figure 06.

06

Figure 06

– Print the best model

And finally, we can print how the best parameters perform with the hold-out dataset as shown in the Figure 07.

07

Figure 07 

It’s remarkable to say that these parameters build/create accurate models which don’t overfit. We can take any of these parameters to build the final predictive model which can be use accurately to predict unseen data points.

In this article, I’ve explained a strategy for avoiding overfitting in predictive models as well as the concepts are illustrated with real-world code in Scala/Apache Spark.

I look forward to hearing from you with your thoughts/comments/new ideas.

Installing a standalone Apache Spark cluster

01

In this post, I’d like to share my experiences about architecting and provisioning an Apache Spark cluster via scripts.

In these days, I’m working hard with customers for provisioning clusters and running machine learning algorithms in order to discover the most accurate predictive model on hundreds of billions of data rows. Because of the big amount of data rows, then the cluster tends to have several nodes; so we need to have some kind of scripts (in my case, for bash shell in Ubuntu 14.04) to automate and streamline (reduce working time) the creation of the cluster due to the number of nodes.

Conceptually, a standalone Apache Spark cluster is similar to other distributed architecture that I’ve designed for different technologies: Apache Mesos, load balancer+app/http_servers, MPICH, Hadoop file system/YARN, etc.

I mean we have a master node (sometimes a secondary node for service availability coordinated each other by ZooKeepers) and several worker nodes.

The master is the orchestra leader that coordinates the work to get done, deals with some possible failures and recovers, and finally gathers the results.

The worker nodes do the real hard work, data crunching and/or predictive model creation.

The communication channel between the nodes in the cluster is done using SSH.

A common deployment topology is like this:

  • Master node:
    • Apache Spark viewpoint: We have the cluster manager that launches the driver programs (program with the data processing logic) and coordinates the computing resources for its execution. It’s like OS scheduler but at a cluster level
    • Hadoop FS viewpoint: We have the name node with the file system metadata. It contains the information of the tree structure and inodes and it knows where each data block resides physically in the cluster (in which data node)
  • Worker nodes:
    • Apache Spark viewpoint: We have the executors. This is the runtime environment (computing resources container) where the tasks associated to the driver program logic are really executed. In this case, the tasks are executed in parallel (several tasks in the same node) and distributed (tasks on different nodes)
    • Hadoop FS viewpoint: We have the data nodes. This is where the data blocks (each files is divides in several data blocks) physically reside. Data blocks are distributed in the cluster for scalability/high performance and replicated by default at least in 3 nodes for availability (of course, we need at least 3 nodes in the cluster for this). Each data block is by default 64MB of size.

I can visualize the former deployment topology using the following diagram as shown in the figure 01.

02

Figure 01

These are the steps that I follow to create a cluster:

Step 01. In all the nodes, I configure the DNS (/etc/hosts) to associated hostname and IP correctly

Step02: In all the nodes, I create a user for running Spark and Hadoop under this security context. Let’s say sparkuser. As well as, I install the dependencies: Java and Scala

Step03. In the master node, I configure SSH login without password for root and sparkuser onto the worker nodes. For root user, it makes easy the automated provisioning. For the sparkuser user it’s required for running the Spark cluster. This is done using PKI mechanism and there are a lot of docs related

Step04. In all the nodes, I install and configure the base Hadoop as shown in the listing 01.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

echo “—– Downloading and installing Hadoop 2.7.1 in the directory $HADOOPBASEDIR/hadoop”

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

tar zxvf hadoop-2.7.1.tar.gz

rm -fr hadoop-2.7.1.tar.gz

mv hadoop-2.7.1 $HADOOPBASEDIR

mv $HADOOPBASEDIR/hadoop-2.7.1 $HADOOPBASEDIR/hadoop

echo “—– Setting environment variables for Hadoop 2.7.1”

echo -e “\n# Setting environment variables for Hadoop 2.7.1” >> /etc/profile

echo “export HADOOP_HOME=$HADOOPBASEDIR/hadoop” >> /etc/profile

echo ‘export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin’ >> /etc/profile

echo ‘export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop’ >> /etc/profile

echo “—– Reloading environment variables for Hadoop 2.7.1”

source /etc/profile

echo “—– Configuring Hadoop 2.7.1”

echo “—– Setting the hadoop-env.sh file”

sed -i ‘s/${JAVA_HOME}/\/usr\/lib\/jvm\/java-8-oracle/g’ $HADOOPCONF/hadoop-env.sh

echo “—– Setting the core-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/core-site.xml

sed -i ‘$ d’ $HADOOPCONF/core-site.xml

cat >> $HADOOPCONF/core-site.xml <<EOF

<property>

<name>fs.defaultFS</name>

<value>hdfs://$SPARKMASTER:9000</value>

</property>

</configuration>

EOF

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

Listing 01

Step04a. In the master node, I configure the name node as shown below in the listing 02.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

SPARKDATADIR=$HADOOPBASEDIR/hadoop/hadoop_data/hdfs/namenode

#WORKERNODES

echo “—– Configuring Hadoop platform on the master node $SPARKMASTER”

echo “—– Creating directory”

mkdir -p $SPARKDATADIR

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

echo “—– Setting the hdfs-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/hdfs-site.xml

sed -i ‘$ d’ $HADOOPCONF/hdfs-site.xml

cat >> $HADOOPCONF/hdfs-site.xml <<EOF

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file://$SPARKDATADIR</value>

</property>

</configuration>

EOF

echo “—– Setting the masters nodes”

echo “$SPARKMASTER” >> $HADOOPCONF/masters

echo “—– Setting the slaves nodes”

sed -i ‘$ d’ $HADOOPCONF/slaves

for worker in $(cat conf/spark.worker.nodes.conf)

do

echo “$worker” >> $HADOOPCONF/slaves

done

Listing 02

Step04b. In the worker nodes, I configure the data nodes pointing correctly to the name node as shown below in the listing 03.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

HADOOPBASEDIR=$(cat conf/deploy.dir.conf)

HADOOPCONF=$HADOOPBASEDIR/hadoop/etc/hadoop

SPARKDATADIR=$HADOOPBASEDIR/hadoop/hadoop_data/hdfs/datanode

echo “—– Configuring Hadoop platform on the worker nodes”

echo “—– Creating directory”

mkdir -p $SPARKDATADIR

echo “—– Granting rights for the user $SPARKUSER on the Hadoop directory $HADOOPBASEDIR/hadoop”

chown -R $SPARKUSER $HADOOPBASEDIR/hadoop

echo “—– Setting the hdfs-site.xml file”

sed -i ‘s/<\/configuration>//g’ $HADOOPCONF/hdfs-site.xml

sed -i ‘$ d’ $HADOOPCONF/hdfs-site.xml

cat >> $HADOOPCONF/hdfs-site.xml <<EOF

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file://$SPARKDATADIR</value>

</property>

</configuration>

EOF

Listing 03

Step05. In all the nodes, I install and configure the base Spark as shown in the listing 04.

#!/bin/bash

# Parameters for settings

SPARKUSER=$(cat conf/spark.user.name.conf)

SPARKMASTER=$(cat conf/spark.master.node.conf)

SPARKBASEDIR=$(cat conf/deploy.dir.conf)

echo “—– Downloading and installing Spark 1.4.1 in the directory $SPARKBASEDIR/spark”

wget http://apache.mirrors.tds.net/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz

tar zxvf spark-*

rm -fr spark-1.4.1-bin-hadoop2.4.tgz

mv spark-* $SPARKBASEDIR

mv $SPARKBASEDIR/spark-* $SPARKBASEDIR/spark

echo “—– Setting environment variables for Spark 1.4.1”

echo -e “\n# Setting Environment Variables for Spark 1.4.1” >> /etc/profile

echo “export SPARK_HOME=$SPARKBASEDIR/spark” >> /etc/profile

echo ‘export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin’ >> /etc/profile

echo “—– Reloading environment variables for Spark 1.4.1”

source /etc/profile

echo “—– Granting rights for the user $SPARKUSER on the Spark directory $SPARKBASEDIR/spark”

chown -R $SPARKUSER $SPARKBASEDIR/spark

echo “—– Configuring Spark 1.4.1 – File: spark-env.sh”

cp $SPARKBASEDIR/spark/conf/spark-env.sh.template $SPARKBASEDIR/spark/conf/spark-env.sh

cat <<EOF >>$SPARKBASEDIR/spark/conf/spark-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export SPARK_PUBLIC_DNS=$SPARKMASTER

export SPARK_WORKER_CORES=6

EOF

Listing 04

Step05a. And finally, in the master node, I do the configuration required to point to the worker nodes who will do the actual tasks/jobs as shown in the listing 05.

#!/bin/bash

# Parameters for settings

SPARKMASTER=$(cat conf/spark.master.node.conf)

SPARKBASEDIR=$(cat conf/deploy.dir.conf)

#WORKERNODES

echo “—– Configuring Spark platform on the master node $SPARKMASTER”

echo “—– Setting the slaves nodes”

for worker in $(cat conf/spark.worker.nodes.conf)

do

echo “$worker” >> $SPARKBASEDIR/spark/conf/slaves

done

Listing 05

Using the former scripts, I can automatically provision a standalone Apache Spark cluster with any number of worker nodes.

You can use it in your own provisioning and configuration, and I really appreciate if you let me know your thoughts/feedback/comments in order to improve it with your own scenario.

Collaborative Filtering in Apache Spark

01

One of the most important applications of machine learning in the business context is for the development of a recommender system. The most well known recommender engine is Amazon´s. There are two approaches to implement a recommender engine:

  • Collaborative filtering: Base the predictions from user’s past behavior (rated, ranked or purchased items) in order to recommend additional items to users who made similar decisions. We don’t need to know the features (content) of the users or the items
  • Content-based filtering: Base the predictions on the features of the items in order to recommend additional items with similar features

For our example, we’re going to use collaborative filtering approach because our dataset is in the format (user_id,item_id(movies),rating,timestamp) to recommend a list of movies to a user without knowing anything about the characteristics of the users and movies.

The driver for collaborative filtering comes from the assumption that people often get the best recommendations from someone with similar tastes and opinions to themselves. Apache Sparks implements Alternating Least Squares (ALS) algorithm as part of its MLLib library.

As the sample dataset, we’re going to use the movielens at http://grouplens.org/datasets/movielens/. The dataset summary is:

  • data file contains the full dataset recording when a user rating a movie. Each row contains (user_id, movie(item)_id, rating, timestamp) separated by a tab
  • user file contains the details of the users
  • item file contains the details of the movies
  • 100,000 rating (between 1 and 5)
  • 943 users
  • 1682 movies (items)

First step is to prepare the environment. Read the input file path from the environment variables and start a Spark Context as shown listing 01.

02

Listing 01

Next step is to represent the external data into an internal structure using a RDD and the case class Rating, by splitting each line by a tab separator and mapping fields to the case class Rating. Finally, we cache in memory the RDD because ALS algorithm is iterative and needs to access the data several times. Without caching, the RDD must repeatedly be recomputed each time ALS iterates. The code is shown in the listing 02.

03

Listing 02

Next step is to build the recommendation model.

It’s remarkable to say that there are two types of user preferences:

  1. Explicit preference. Treats each entry in the user-item matrix as explicit preference given by the user to the item. For example, a rating given to items by users
  2. Implicit preference. Getting an implicit feedback. For example, views, clicks, buy history

In this case, we’re using the ALS.train method as shown in the listing 03.

04

Listing 03

Next step is to evaluate the performance of the recommendation model as shown below in the Listing 04

05

Listing 04

Finally, if the performance is very good, we can start making recommendations for our users based on the former recommendation model. Let’s suppose that we want to make five movies recommendations for the user with id equal to 196, then we write the code as shown in the listing 05. Of course, to see the real name of the movies, we need to look for the u.item file by the returning values.

06

Listing 05

Now, we can apply this principles, knowledge and examples to your own solutions.

Computing statistics using Apache Spark

I’ve working heavily with the duet Apache Spark (for computing) and HDFS (as the distributed storage) for processing a very large volume of data. My customers want to get insights about their data and Apache Spark is a great tool for it. Using Apache Spark, I can crunch big data and compute statistics and generate predictive models using a bunch of machine learning libraries.

In this post, I want to show how simple is to compute statistics using Apache Spark and Scala.

The Apache Spark programming model is very simple. Our data is encapsulated in a RDD (Resilient Distributed Dataset). RDD is a monad that can be transformed by passing functions to its high-order functions (map, filter, foreach, countByValue, etc). Monads, functions and high-order functions are key concepts of functional programming and they’re well supported by Scala. It’s remarkable to say that RDD splits the data into partitions (data chunks) and distributes them onto the nodes of the cluster to process objects within a partition in sequence and multiple partitions in parallel. When we create a RDD, no execution takes place yet on the cluster; because RDD defines a graph of logical transformations over the dataset and the real execution takes place when we call an action (collect, foreach, counting, etc).

Let’s supposed that we have a big data file (aprox 1GB) with the following structure as shown in the listing 01. The last attribute (is_match) is a categorical one and the target. The remainder attributes are the features. We’re going to process the first (integer), second (integer) and the seven (optional double) ones.

“id”,”fea_1″,”fea_2″,”fea_3″,”fea_4″,”fea_5″,”fea_6″,”fea_7″,”fea_8″,”fea_9″,”fea_10″,”is_match”

37291,53113,1,?,1,1,?,1,1,1,0,TRUE

39086,47614,1,?,1,1,1.3,1,1,1,1,FALSE

70031,70237,1,?,1,1,?,1,1,1,1,TRUE

84795,97439,1,?,1,1,3.2,1,1,1,1,FALSE

Listing 01

Now let’s start writing code to process this data file. First of all, we need a representation of our data using a case class in Scala as shown in the listing 02. In this case, the data point has three features and its target. We also need a way to convert optional double to an internal representation using the toDouble function.

02

Listing 02

Next step is process the data file and get some statistics.

The very first step is to create a SparkContext object representing the communication with the cluster as shown in the listing 03.

03

Listing 03

Then we need to create a RDD from the SparkContext.textFile method to point the data file. Apache Spark can access to the local file system, distributed file system such as HDFS and Amazon S3 as well as databases such as Cassandra and HBase. When the RDD is created, then we can operate with transformations (transform the RDD into a new RDD) and actions (return a result to the driver program and trigger the real execution).

In our example, we filter out the header and for each line, split by the field separator and create instances of DataPoint case class. Finally, we persist the RDD in memory for reusing in the future. The code is shown in the listing 04.

04

Listing 04 

Next step is to create a histogram related to the target attribute (a categorical one). We’re going to use the countByValue method with a projection on the target property. This is an efficient implementation of the aggregation because when aggregating large dataset distributed in a cluster we’re facing with the network as the key bottleneck, so we need an efficient mechanism for moving the data over the network (serializing, sending over the wire and deserializing). The output is represented as Map[T, Long]. The code is shown in the listing 05.

05

Listing 05

If we want to get a histogram over a continuous variable, we need to use another way. We have to get a projection on the fea03 property. Then filter out the missing values and finally use the stats method as shown below in the listing 06.

06

Listing 06

In this post, I’ve shown how we can use Apache Spark for computing statistics for a very large dataset. You can apply this ideas and knowledge to your own scenarios.

Back to basics. Algorithm complexity

01

In these days, I’ve been thinking about basics courses when I was in the university studying computer science. There are some basics that IT professionals tend to forget because they think that the lessons learned are far away from everyday work. Paradoxically, there are a lot of lessons which are very closed to the reality with new IT challenges today. One of these forgotten but important lessons is algorithm complexity.

Today, when we (as software/solution/information architect and application designer) are normally faced with the need to crunch a big volume of data almost in real time using the most cost-effective way to make business decision right now.

The knowledge of algorithm complexity is mandatory in order to solve this kind of problems and make right decisions to impact positively in the application and the business. For example, if we have the ability to understand and choose the right data structure and algorithm; we can scale (growth without incrementing significantly the cost and effort) our software application without degradation when the volume of data or transactions increases; otherwise, we end up adding unnecessary computing resources.

Even today when computing resources tends to be cheap thanks to the cloud computing; from the business point of view it’s always necessary to reduce cost in order to save money and be more efficient and competent. In other words, it’s not the same to process a chunk of data with a cluster of 15 nodes than a cluster of 1000 nodes when we/our customers have to pay the bills.

Let me explain the principles around algorithm complexity and how it can help to save money to the business. The time complexity is used to represent how long an algorithm will take for a given amount of data. The mathematical expression to represent this notion is known as big O notation. What’s that? Well, big O notation is expressed by a mathematical function describing/providing as output how many operations an algorithm needs for a given amount of input data. Regarding to the mathematical function, the number of operations may/may not increase as the amount of input data increases.

The common used mathematical functions are:

  • O(1). This is a constant complexity. It costs one operation no matter the amount of input data. For example, to lookup a data item in a hash table or to lookup a data item by its position in an array
  • O(log(n)). The number of operations stays low even with a big volume of data. For example, to search a data item in a well-balanced binary tree when using a binary search
  • O(n). The number of operations increases linearly as the amount of data increases. For example, to search a data item in an array
  • O(log(n)*n). The number operations increase significantly as the amount of data increases. For example, sorting data items using efficient sorting algorithms such as QuickSort, HeapSort and MergeSort
  • O(n2). The number of operations explodes quickly as the amount of data increases
  • O(n3), O(2n), O(n!), O(nn). There are a lot of worst complexities when the number of operations explodes significantly. These are bad time complexities. You should never design an algorithm with a complexity above O(n!) otherwise you should ask yourself if designing and implementing applications is really your field

We can visualize the complexity of different functions (amount of input data à number of operations) as shown below.

02

Now let’s see the concepts with concrete examples.

For example, when the amount of data is low, there is no significant different between O(1) and O(n2) algorithms on data processing because the current processors can handle hundreds of billions of operations per second.

Let’s suppose that we want to process 10,000 data items, we have the following scenario:

  • O(1). It costs 1 operation
  • O(log(n)). It costs 13 operations
  • O(n). It costs 10,000 operations
  • O(log(n)*n). It costs around 130,000 operations
  • O(n2). It costs 10,000,000,000

As we can see the O(n2) algorithm costs 10,000,000,000 operations which may occur in a fraction of a second with the current processors.

Now let’s supposed that we want to process 10,000,000 data items. It’s really not a big database today. We have the following scenario:

  • O(1). It costs 1 operation
  • O(log(n)). It costs 23 operations
  • O(n). It costs 10,000,000 operations
  • O(log(n)*n). It costs around 230,000,000 operations
  • O(n2). It costs 100,000,000,000,000 operations

As we can see the O(n2) algorithm costs 100,000,000,000,000 operations. We can go to take a long sleep while the processing is done.

Now we can see why we, as architect and application designer, must understand deeply the problems and select the right data structure and algorithm for the underlying solutions. If we select the right algorithm/data structure with a low complexity, we’re choosing the way to appropriately scale the application, in other words, to add the right computing resources when the volume of data and/or transactions increases impacting positively on saving a lot of money to our customers.

Getting the most frequently words in a text

common-words-logo

In the recent days, I’ve been involved in the design of high performance and scalable architectures for data science solutions.

So, I would like to share some interesting insights, cookbook recipes and best practices that I’ve discovered and embraced.

In this post, I want to show you how to get the most frequently words in a text using simple data science techniques.

First of all, I want to say that sometimes when I design a data science solution and of course it’s up to the particular scenario, I try to express and validate the proposed algorithm using Unix commands connected by a pipeline before implementing the final logic in distributed platform such as Apache Spark. At the end of the day, this is the fastest way to express and tune the business logic before going to a solution more scalable.

The common use case related to this particular problem is to get a knowledge about the real main theme and thought process of a person delivering a text/speech/book to an audience. For example, when politicians deliver a speech, we can see what’s really the purpose of his own words by reading between the lines, getting insights and leaving out what’s not important.

In this post, we’re going to examine the book “The Adventures of Sherlock Holmes” extracted from the Project Gutenberg.

The solution logic is express in the following steps:

  1. Scan the text and produce a list of lines
  2. For each line, we extract the words, and produce a list of words
  3. For each word, we standardize by converting each word to lower case
  4. We group the word, and count them
  5. We sort the words by frequency (the most frequent in the top)
  6. And finally, we print the 10 most frequently words

The Unix command version is express as shown in the listing 01.

> cat pg1661.txt | \ #Step01
> grep -oE '\w+' | \ #Step02
> tr '[:upper:]' '[:lower:]' | \ #Step03
> sort | uniq -c | \ #Step04
> sort -nr | \ #Step05
> head -n 10 #Step06

Listing 01

Now we need to translate this logic into Apache Spark platform using Scala language as shown below in the listing 02.

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object App { 
   val Input_Datafile_VarName = "INPUT_DATAFILE" 
   def getEnvVar(name: String): Option[String] = Option(System.getenv(name))
   def processDataFile(input: String) {
       val conf = new SparkConf().setAppName("App")
       val sc = new SparkContext(conf)
       println("Reading the data file %s in memory cache".format(input))
       val ds = sc.textFile(input).cache() //Step01
       val result = ds.flatMap(x => x.split("\\W+")).filter(!_.isEmpty) //Step02
                      .map(_.toLowerCase) //Step03
                      .map(w=>(w,1)).reduceByKey(_+_) //Step04
                      .sortBy(_._2,false) //Step05
                      .take(10) //Step06
       result.foreach(println)
   }
   def main(args: Array[String]) {
       val inputFile = getEnvVar(Input_Datafile_VarName)
       for {
           input <- inputFile
       } processDataFile(input)
       println("Ending the Spark application")
   }
}

Listing 02

After building the package, we can run it using the following commands as shown below in listing 03.

> export INPUT_DATAFILE=pg1661.txt
> spark-submit --class "App" --master local[2] target/scala-2.11/my-project-assembly-1.0.jar

Listing 03

Conclusion

Now we can use this solution in your own data science toolbox.