Cross Validation: Training, Validation and Testing Splits

Cross Validation is a resampling technique that uses training, validation and and/or testing splits to chunk up the data for resampling.

Splits in this context refers to splitting up data. For instance, maybe in a dataset of 100,000 images, we select 90% for training, 9% for validation, and 1% for testing. But what do these mean, and how can I make these splits in code?

Differences between Train, Valid, Test

Train/Training Splits

The training split is the training data. As the name implies, this is the dataset we will use to train the model from.

Valid/Validation Splits

The validation split is used in the process of fitting the model to the training data. The validation set provides an unbiased estimation of model fit (unbiased because it was not used during training), and is often used to slightly tweak hyperparameters during training and/or implement early stopping mechanisms (to prevent overfitting when validation metrics reach a certain level).

Many machine learning packages do not have validation splits built into either the model itself (such as within unsupervised learning models: if we don’t have a way to evaluate performance, then what use is a validation set?) or in the toolkits themselves cross validation package only includes train and test (although there are some simple workarounds to make validation sets with it nonetheless).

A validation set is optional, but for some modeling techniques it is highly recommended. A great example of this is neural networks; often there are so many hyperparameters that get set during training that modern machine learning packages can "learn" from their learning, tuning the alpha (learning) rate during backpropogation for instance.

Test/Testing Splits

Test splits are not involved in training a model at all. They are used to assess the generalizability of the model. For instance, after training a neural network model, we want to use our test data to see how well the model can predict the expected values in the test set. If the model was trained well, the predictions should be pretty close to the test data; this suggests that the model has generalized well, that is, it is not overfitting too much.

Unless you have unlabeled data, you will probably be using a test split to assess the performance of your trained model.

Train/Valid/Test Ratios

The eternal question about splits is: what is the correct ratio of splits? Is it 70/25/5? 90/9/1? 85/15?

There are no hard and fast rules about what the correct split ratio is. Often times, the approximate answer to this question revolves around whatever gives you the desired outcome for your model. For instance, maybe you are developing critical software that must be 99.99% accurate in its predictions; finding a split ratio that reaches this bar is the goal, and what that ratio is might not matter.

However, most models gain in accuracy with larger training datasets. As such, a common ratio is something like 80/15/5 or 80/10/10. But, if you have a dataset of millions, then maybe you are ok with training on 60/35/5; A higher validation set here would give more accuracy in fine tuning the model at every training cycle (or epoch). If you have a smaller data set, then putting more samples in the training set will give it more variety during training to prevent underfitting.

It is common practice to start with some training set in between 70-90, validation between 5-25, and test between 5-25. For train/valid/test, try 90/5/5 or 80/10/10. For train/test, try 90/10.

Make Sure Your Splits Are Random!

One very important thing to remember about splits is ensuring they are random. For instance, if you have an alphabetically ordered dataset of names, then making a train split of 80/15/5 would get you all the names starting at A to M for instance. This of course would cause problems when your validation set uses the names from M to X, and your test set uses Y and Z names. Be sure to shuffle your datasets before organizing them into the splits, and ensuring randomness across the splits. Most packages that do cross validation will have functionality to set randomness seeds to ensure randomness.

Challenges With Splits

Splits have two primary issues.

  1. The validation estimate can be highly variable, depending on which observations are put into which split. But how will you know which data to go into which splits?

  2. Since only the training data is used to train the model, and we want our model to be trained on increasing amounts of data, the valid split error rate will tend to overestimate for a model fit on increasing proportions of the data set.

One way to solve these challenges is by using LOOCV or k-fold CV.

How Can I Train On Unlabeled Data?

When we get neat, structured data that isn’t missing any values, we rejoice! But often, especially today, we get unstructured data with no labels- and lots of it. One of the great challenges is how to label data at scale. Here’s a few ways to resolve your data labeling issue.

Manually Label It Yourself

The brute force method might make sense in some cases. If you label 10,000 data points and then can train a model off of it that gets ideal test (read: not train or valid) metrics, then maybe this is a decent approach.

The challenge here is that it’s often expensive and time consuming to label data. Usually there are edge cases that are tough to figure out. And on top of all this, how do you know how many data points it will take before you have a model that can achieve acceptable metrics?

Use a Pretrained Model

In some cases, you could use a pretrained model, make predictions on data that you know is easier to predict, and use those labels to train a new model, OR you could build off of the pretrained model. However, in this situation, you would have to trust that the pretrained model is right.

Unsupervised Learning

Unsupervised learning models can be built without labeled data. For example, clustering could be used to assign a label directly to each clustering. Then, once you have labeled data, you could then train a model on it. The key issue with this approach is, how good did my unsupervised learning approach do in categorizing/clustering/labeling the data? There might even be room to oscillate back and forth between using an unsupervised learning approach to label, training a supervised learning model, getting new data to test both of them on and see how accurate they are, playing them off each other, etc.

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.
Quickstart: Download Docker, then run the commands below in a terminal.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:train-valid-test

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:train-valid-test

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!