Getting the most frequently words in a text


In the recent days, I’ve been involved in the design of high performance and scalable architectures for data science solutions.

So, I would like to share some interesting insights, cookbook recipes and best practices that I’ve discovered and embraced.

In this post, I want to show you how to get the most frequently words in a text using simple data science techniques.

First of all, I want to say that sometimes when I design a data science solution and of course it’s up to the particular scenario, I try to express and validate the proposed algorithm using Unix commands connected by a pipeline before implementing the final logic in distributed platform such as Apache Spark. At the end of the day, this is the fastest way to express and tune the business logic before going to a solution more scalable.

The common use case related to this particular problem is to get a knowledge about the real main theme and thought process of a person delivering a text/speech/book to an audience. For example, when politicians deliver a speech, we can see what’s really the purpose of his own words by reading between the lines, getting insights and leaving out what’s not important.

In this post, we’re going to examine the book “The Adventures of Sherlock Holmes” extracted from the Project Gutenberg.

The solution logic is express in the following steps:

  1. Scan the text and produce a list of lines
  2. For each line, we extract the words, and produce a list of words
  3. For each word, we standardize by converting each word to lower case
  4. We group the word, and count them
  5. We sort the words by frequency (the most frequent in the top)
  6. And finally, we print the 10 most frequently words

The Unix command version is express as shown in the listing 01.

> cat pg1661.txt | \ #Step01
> grep -oE '\w+' | \ #Step02
> tr '[:upper:]' '[:lower:]' | \ #Step03
> sort | uniq -c | \ #Step04
> sort -nr | \ #Step05
> head -n 10 #Step06

Listing 01

Now we need to translate this logic into Apache Spark platform using Scala language as shown below in the listing 02.

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object App { 
   val Input_Datafile_VarName = "INPUT_DATAFILE" 
   def getEnvVar(name: String): Option[String] = Option(System.getenv(name))
   def processDataFile(input: String) {
       val conf = new SparkConf().setAppName("App")
       val sc = new SparkContext(conf)
       println("Reading the data file %s in memory cache".format(input))
       val ds = sc.textFile(input).cache() //Step01
       val result = ds.flatMap(x => x.split("\\W+")).filter(!_.isEmpty) //Step02
                      .map(_.toLowerCase) //Step03
                      .map(w=>(w,1)).reduceByKey(_+_) //Step04
                      .sortBy(_._2,false) //Step05
                      .take(10) //Step06
   def main(args: Array[String]) {
       val inputFile = getEnvVar(Input_Datafile_VarName)
       for {
           input <- inputFile
       } processDataFile(input)
       println("Ending the Spark application")

Listing 02

After building the package, we can run it using the following commands as shown below in listing 03.

> export INPUT_DATAFILE=pg1661.txt
> spark-submit --class "App" --master local[2] target/scala-2.11/my-project-assembly-1.0.jar

Listing 03


Now we can use this solution in your own data science toolbox.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s