In the recent days, I’ve been involved in the design of high performance and scalable architectures for data science solutions.
So, I would like to share some interesting insights, cookbook recipes and best practices that I’ve discovered and embraced.
In this post, I want to show you how to get the most frequently words in a text using simple data science techniques.
First of all, I want to say that sometimes when I design a data science solution and of course it’s up to the particular scenario, I try to express and validate the proposed algorithm using Unix commands connected by a pipeline before implementing the final logic in distributed platform such as Apache Spark. At the end of the day, this is the fastest way to express and tune the business logic before going to a solution more scalable.
The common use case related to this particular problem is to get a knowledge about the real main theme and thought process of a person delivering a text/speech/book to an audience. For example, when politicians deliver a speech, we can see what’s really the purpose of his own words by reading between the lines, getting insights and leaving out what’s not important.
In this post, we’re going to examine the book “The Adventures of Sherlock Holmes” extracted from the Project Gutenberg.
The solution logic is express in the following steps:
- Scan the text and produce a list of lines
- For each line, we extract the words, and produce a list of words
- For each word, we standardize by converting each word to lower case
- We group the word, and count them
- We sort the words by frequency (the most frequent in the top)
- And finally, we print the 10 most frequently words
The Unix command version is express as shown in the listing 01.
Now we need to translate this logic into Apache Spark platform using Scala language as shown below in the listing 02.
After building the package, we can run it using the following commands as shown below in listing 03.
Now we can use this solution in your own data science toolbox.