By: Juan Carlos Olamendy Turruellas
When we build a predictive model, the key objective is the accuracy. Accuracy means that the model not only applies correctly to the training dataset but to the unseen dataset. When a model tends to predict/fit very good the training dataset (100% of accuracy) but tends to fail with unseen dataset (50% of accuracy), then it’s said that the model has overfit. Overfitting is a phenomenon that we need to deal with because all machine-learning algorithms have a tendency to overfit to some extent.
Researching on several books and Internet, I’ve discovered some patterns that allow me to formulate a strategy to avoid overfitting. I want to share this strategy for receiving feedbacks and improve it. In order to make it clear, I’ll materialize the concepts using some codes in Apache Spark and the spark-repl for running interactive commands.
The strategy can be visualized in the following roadmap in the figure 01.
– Split randomly the dataset
The first step is to randomly split the original dataset into three non-overlapping datasets:
- Training dataset: to create predictive models
- Cross-validation dataset: to evaluate the parameters performance of the predictive models created using the training dataset
- Holdout dataset: to evaluate the performance of the best predictive models, and to measure how these models generalize correctly. These true values are hidden from the creation process of the predictive model and predicted values
We can see an example using Scala in Spark as shown in the figure 02.
– Build models
After that, we need to build predictive models using different values for the underlying parameters. In our example, we’re building a classifier using decision tree algorithms. We need to evaluate different options (for this kind of algorithm) for entropy, tree depth and bins (for more details, you can read the implementation in the Spark documentation). The output is a list of the tuple (model, precision, impurity, depth, bin).
The precision indicates how much the model is overfitting or generalizing by outputting a measure of true positive versus false positive. The higher value, the better accuracy.
It’s remarkable to say that these models are built using the training and cross-validation data set.
I illustrate the concepts using the following code as shown in the figure 03.
– Get top performance parameters
Next step is to see the parameters ordered by the best precisions as shown in the figure 04.
We can see for the case of decision trees the depth is what affects negatively on the predictive model accuracy. The less depth, the less accuracy.
Then, we get and print the top 5 performance parameters as shown below in the figure 05.
– Build and evaluate model with top performance parameters
Next step is to evaluate how accurate the best predictive models perform regarding to the hold-out/testing dataset. This step shows if the predictive models are really overfitting/memorizing the training data set or not.
We need to create a new training dataset comprising the former training and the cross-over datasets. We use the best parameters from the point of view to tune the model creation process.
The step is illustrated using the Figure 06.
– Print the best model
And finally, we can print how the best parameters perform with the hold-out dataset as shown in the Figure 07.
It’s remarkable to say that these parameters build/create accurate models which don’t overfit. We can take any of these parameters to build the final predictive model which can be use accurately to predict unseen data points.
In this article, I’ve explained a strategy for avoiding overfitting in predictive models as well as the concepts are illustrated with real-world code in Scala/Apache Spark.
I look forward to hearing from you with your thoughts/comments/new ideas.