Cross-Validation (a.k.a. Rotation Estimation)

 

Cross-validation is a pretty cool method to compare different models that you’re trying to fit your data to, and see which models are yield the most robust results. This inherently means that the most robust models are the least prone to find themselves overfitting to your data.



Theory

Types

Leave-One-Out Cross-Validation

Leave-p-Out Cross-Validation

k-Fold Cross-validation



Examples

So here are two examples of using it. For these examples, we will be comparing different polynomial orders, e.g.

\[y = \beta_0 + \beta_1 x_1\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]

and we will be using the mean squared error to quantify our errors.


Big Picture First

Import the Libraries

import numpy as np
import matplotlib.pyplot as plt

Set the Highest Order to be used

high_order = 20
polyOrders = np.array(range(high_order))

Generate the Numbers

We will first set a number of datapoint \(N\), and for this case, I just decided to have

\[\vec{y} = \vec{y_1} + \vec{y_2} + \vec{y_3} + \vec{y_4} \quad \text{ where}\] \[\begin{cases} \vec{y_1} = \vec{x} + \text{randomness} \\ \vec{y_2} = \vec{x}^2 + \text{randomness} \\ \vec{y_3} = \vec{x}^3 + \text{randomness} \\ \vec{y_4} = 30 \sin(\vec{x}) + \text{randomness} \\ \end{cases}\]

because why not?

N = 40
x = np.linspace(0,10,N)
rand_mag = 100

y1 = x      + np.random.randn(x.shape[0])*rand_mag
y2 = x**2   + np.random.randn(x.shape[0])*rand_mag
y3 = x**3   + np.random.randn(x.shape[0])*rand_mag
y4 = 30*np.sin(x) + np.random.randn(x.shape[0])*rand_mag
y = (y1+y2+y3+y4)/4

When plotting it, we MAY have the following data (don’t forget that there is a randomness that I added to it)

plt.figure()
plt.plot(x,y,'bo',label='Random Numbers')
plt.xlabel('x')
plt.ylabel('y')

[INSERT FIGURE]

What’s our main question?

So, the question is,

And we can fit it to a polynomial by using numpy.polyfit(), e.g.

ORDER = 2
coeffs2 = np.polyfit(x, y, ORDER)
polyFn2 = np.poly1d(coeffs2)
plt.figure()
plt.plot(x,y,'bo'); plt.plot(x,polyFn2(x),'r-')

[INSERT FIGURE]

But the question is:
We can fit multiple different polynomial order models to our data, but which one is the best? How can we decide it?

[INSERT MULTIPLE FIGURES - 6 orders in 2x3 or 3x2 fashion]

Below we will see only two types of cross-validation being applied


Leave-One-Out Cross-Validation (LOOCV)

As explained above, the Leave-One-Out method looks at every possibility when you remove a single datapoint, train the model on the \(N-1\) datapoints, and then see how the robust is the model towards that datapoint

Again, these scripts may not be the most efficient ways of doing cross-validation, but it’s to lay a good framework of how it works

Initiate the Results

We can first create a results array, where we will store the results. This array will be in \(\mathbb{R}^{\text{Number of Orders} \times N}\), where \(N\) is the number of datapoints

Validate it for Every Datapoint

So as stated before, we’re going to iterate through every datapoint, and for each datapoint, we’re going to remove it, and train the model based on the rest.

        # PARTIAL
        # Remove this datapoint
        x_mod = np.delete(x,iter)
        y_mod = np.delete(y,iter)

        # Train the model
        coeffs = np.polyfit(x_mod,y_mod,order)
        polyFn = np.poly1d(coeffs)

We will then predict the labels \(y_{pred}\), and then, by using the mean squared error (we will first compute the squared residual, and later on take the mean)

\[Residual^2 = (y_{true} - y_{pred})^2\]
        # PARTIAL
        # Predict the results
        y_pred = polyFn(x[iter])
        y_true = y[iter]

        # Get the residual squared on that single data point that was excluded
        thisError = (y_true - y_pred)**2

Then we will store that \(residual^2\) on the results array, and we can plot it for funsies too

        # PARTIAL
        # Store on Results
        results[order,iter] = thisError

        # (OPTIONAL) Plot those predictions
        plt.plot(x,polyFn(x),'r-',label='Predicted Points')

Iterate through different polynomial orders

All there is left to do is to let it iterate it through different orders

for order in range(high_order):

Complete Validation of Every Point for Every Polynomial Order

Your complete cross-validation will be the following:

results = np.zeros((high_order,N))

for order in range(high_order):
    # For each polynomial order...
    plt.figure(order)
    plt.plot(x,y,'bo',label='Modified Points')

    for iter in range(len(x)):
        # For every data point...

        # Remove this datapoint
        x_mod = np.delete(x,iter)
        y_mod = np.delete(y,iter)

        # Train the model
        coeffs = np.polyfit(x_mod,y_mod,order)
        polyFn = np.poly1d(coeffs)

        # Predict the results
        y_pred = polyFn(x[iter])
        y_true = y[iter]

        # Get the residual squared on that single data point that was excluded
        thisError = (y_true - y_pred)**2

        # Store on Results
        results[order,iter] = thisError

        # Plot those predictions
        plt.plot(x,polyFn(x),'r-',label='Predicted Points')
        #plt.plot(x[iter],y[iter],'g*',markersize=20,label='Removed Point')
    plt.xlabel('x'); plt.ylabel('y'); plt.title('Polynomial (Order = %d)' % order)

[INSERT A FEW OF THE FIGURES 3x2 or 2x3]

Show Your Results

All you have to do left is to plot your results. Here, remember how I didn’t compute the mean of the MSE? I am now going to compute that mean

# Remove zeroth order
results = np.delete(results,0,0)

#Calculate Estimated Prediction Error
PE = np.mean(results,axis=1)

# Show the Estimated Mean Prediction Errors for Each Polynomial
plt.figure()
plt.plot(PE,'bo-')
plt.xlabel('Polynomial Order')
plt.ylabel('Estimated Mean Prediction Error')

[INSERT THAT LAST FIGURE!]


k-Fold Cross-validation

As explained above, the k-Fold cross validation requires one to separate your data into k parts, which may be randomized or not, remove one of the parts, train the data on the rest, and iterate through the different parts.

To do so, you can choose to build your partitioning algorithm from scratch (I highly suggest it as it cranks out your coding kung fu), or you can use SciKit Learn’s KFold

Import that Function

from sklearn.model_selection import KFold

Initialize Some Constants

In this case, I’m setting that I’ll repeat the experiment for 20 rounds, and I would like to partition the data into 5 parts. I am also initializing a list results_list, where I’ll store my results

rounds = 20
K = 5
results_list = []               # Each element will be for one round

Start Up the KFold Class

So, we gotta start up our KFold class, to start to use its set of tools, and then we can split up our data, which is stored in x

    # Initiate the KFold Class
    kf = KFold(n_splits=K, shuffle=True)

    # Separate Data into K parts
    kf.get_n_splits(x)

If you’d like to get some of the status, we can write

>>> kf
KFold(n_splits=10, random_state=None, shuffle=True)

Start Validating Your Parts/Folds

For each experiment (i.e round), we will use an array results to get the validation results.

    results = np.zeros((high_order,K))

For each part/fold, we will separate the data into the training set (x_train, y_train) and testing set (x_test, y_test)

    # PARTIAL
    for iter, (train_index, test_index) in enumerate(kf.split(x)):
        # For each split...
        # Now we have the indices for the training set and the test set
        x_train, x_test = x[train_index], x[test_index]
        y_train, y_test = y[train_index], y[test_index]

in which we can plot it for funsies

        # PARTIAL
        # Plot the training set and the test set together
        plt.figure()
        plt.plot(x_train,y_train,'bo')
        plt.plot(x_test,y_test,'rx')
        plt.xlabel('x'); plt.ylabel('y')

Now for each polynomial order, we can train the model, and create a polynomial function

        # PARTIAL
        for order in range(high_order):
            # For each polynomial order...
            # Train the model based on the training set and create a polynomial function
            coeffs = np.polyfit(x_train,y_train, order )
            polyFn = np.poly1d(coeffs)

Next, we calculate the predictions \(y_{pred}\), and we can calculate the mean squared error based on the residuals, and we can store that onto the results (numpy array).

            # Calculate the predictions
            y_pred = polyFn(x_test)

            # Calculate the estimated prediction error
            thisError = np.mean((y_test - y_pred)**2)

            # Store on Results
            results[order,iter] = thisError

Remember, the way that we are storing things is, if we had 3 parts/folds and were testing on two polynomial orders

\[y = \beta_0 + \beta_1 x_1\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2\]

All there’s left to do is to store it onto the results_list

    # PARTIAL
    # Put it on the list!
    results_list.append(thisResult)

Complete Validation

Your complete validation for the K-Fold CV will be now iterating through different experiments (i.e. rounds), where in each experiment, you pull out some random elements, validate it, then in the next experiment, you put everything back into that bag, shake it, and repeat.

from sklearn.model_selection import KFold
rounds = 20
K = 5
results_list = []               # Each element will be for one round
for round in range(rounds):
    # For each round of results...
    # Initiate the KFold Class
    kf = KFold(n_splits=K, shuffle=True)

    # Separate Data into K parts
    kf.get_n_splits(x)

    # To get status
    kf
    # KFold(n_splits=10, random_state=None, shuffle=True)

    results = np.zeros((high_order,K))

    for iter, (train_index, test_index) in enumerate(kf.split(x)):
        # For each split...
        # Now we have the indices for the training set and the test set
        x_train, x_test = x[train_index], x[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Plot the training set and the test set together
        # plt.figure()
        # plt.plot(x_train,y_train,'bo')
        # plt.plot(x_test,y_test,'rx')
        # plt.xlabel('x'); plt.ylabel('y')

        for order in range(high_order):
            # For each polynomial order...
            # Train the model based on the training set and create a polynomial function
            coeffs = np.polyfit(x_train,y_train, order )
            polyFn = np.poly1d(coeffs)

            title_str = 'Polynomial (Order %d)' % order     # For book keeping

            # Calculate the predictions
            y_pred = polyFn(x_test)

            # Calculate the estimated prediction error
            thisError = np.mean((y_test - y_pred)**2)

            # Store on Results
            results[order,iter] = thisError

        thisResult = np.mean(results,axis=1)

    # Put it on the list!
    results_list.append(thisResult)

Showing Your Results

Now, all you have to do is show your results, assuming you ran everything in the “Big Picture” section.

for result_item in results_list:
    plt.plot(polyOrders[1:],result_item[1:],'o-')
plt.ylim((1000,10000))
plt.xlabel('Polynomial Order')
plt.ylabel('PE Value')