A strategy for avoiding overfitting


By: Juan Carlos Olamendy Turruellas

When we build a predictive model, the key objective is the accuracy. Accuracy means that the model not only applies correctly to the training dataset but to the unseen dataset. When a model tends to predict/fit very good the training dataset (100% of accuracy) but tends to fail with unseen dataset (50% of accuracy), then it’s said that the model has overfit. Overfitting is a phenomenon that we need to deal with because all machine-learning algorithms have a tendency to overfit to some extent.

Researching on several books and Internet, I’ve discovered some patterns that allow me to formulate a strategy to avoid overfitting. I want to share this strategy for receiving feedbacks and improve it. In order to make it clear, I’ll materialize the concepts using some codes in Apache Spark and the spark-repl for running interactive commands.

The strategy can be visualized in the following roadmap in the figure 01.


Figure 01 

– Split randomly the dataset

The first step is to randomly split the original dataset into three non-overlapping datasets:

  • Training dataset: to create predictive models
  • Cross-validation dataset: to evaluate the parameters performance of the predictive models created using the training dataset
  • Holdout dataset: to evaluate the performance of the best predictive models, and to measure how these models generalize correctly. These true values are hidden from the creation process of the predictive model and predicted values

We can see an example using Scala in Spark as shown in the figure 02.


Figure 02

– Build models

After that, we need to build predictive models using different values for the underlying parameters. In our example, we’re building a classifier using decision tree algorithms. We need to evaluate different options (for this kind of algorithm) for entropy, tree depth and bins (for more details, you can read the implementation in the Spark documentation). The output is a list of the tuple (model, precision, impurity, depth, bin).

The precision indicates how much the model is overfitting or generalizing by outputting a measure of true positive versus false positive. The higher value, the better accuracy.

It’s remarkable to say that these models are built using the training and cross-validation data set.

I illustrate the concepts using the following code as shown in the figure 03.


Figure 03

– Get top performance parameters

Next step is to see the parameters ordered by the best precisions as shown in the figure 04.


Figure 04

We can see for the case of decision trees the depth is what affects negatively on the predictive model accuracy. The less depth, the less accuracy.

Then, we get and print the top 5 performance parameters as shown below in the figure 05.


Figure 05

– Build and evaluate model with top performance parameters

Next step is to evaluate how accurate the best predictive models perform regarding to the hold-out/testing dataset. This step shows if the predictive models are really overfitting/memorizing the training data set or not.

We need to create a new training dataset comprising the former training and the cross-over datasets. We use the best parameters from the point of view to tune the model creation process.

The step is illustrated using the Figure 06.


Figure 06

– Print the best model

And finally, we can print how the best parameters perform with the hold-out dataset as shown in the Figure 07.


Figure 07 

It’s remarkable to say that these parameters build/create accurate models which don’t overfit. We can take any of these parameters to build the final predictive model which can be use accurately to predict unseen data points.

In this article, I’ve explained a strategy for avoiding overfitting in predictive models as well as the concepts are illustrated with real-world code in Scala/Apache Spark.

I look forward to hearing from you with your thoughts/comments/new ideas.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s