Cross-validation is a pretty cool method to compare different models that you’re trying to fit your data to, and see which models are yield the most robust results. This inherently means that the most robust models are the least prone to find themselves overfitting to your data.

# Theory

## Types

### Leave-One-Out Cross-Validation

### Leave-p-Out Cross-Validation

### k-Fold Cross-validation

# Examples

So here are two examples of using it. For these examples, we will be comparing different polynomial orders, e.g.

\[y = \beta_0 + \beta_1 x_1\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]and we will be using the mean squared error to quantify our errors.

## Big Picture First

### Import the Libraries

```
import numpy as np
import matplotlib.pyplot as plt
```

### Set the Highest Order to be used

```
high_order = 20
polyOrders = np.array(range(high_order))
```

### Generate the Numbers

We will first set a number of datapoint \(N\), and for this case, I just decided to have

\[\vec{y} = \vec{y_1} + \vec{y_2} + \vec{y_3} + \vec{y_4} \quad \text{ where}\] \[\begin{cases} \vec{y_1} = \vec{x} + \text{randomness} \\ \vec{y_2} = \vec{x}^2 + \text{randomness} \\ \vec{y_3} = \vec{x}^3 + \text{randomness} \\ \vec{y_4} = 30 \sin(\vec{x}) + \text{randomness} \\ \end{cases}\]because why not?

```
N = 40
x = np.linspace(0,10,N)
rand_mag = 100
y1 = x + np.random.randn(x.shape[0])*rand_mag
y2 = x**2 + np.random.randn(x.shape[0])*rand_mag
y3 = x**3 + np.random.randn(x.shape[0])*rand_mag
y4 = 30*np.sin(x) + np.random.randn(x.shape[0])*rand_mag
y = (y1+y2+y3+y4)/4
```

When plotting it, we **MAY** have the following data (don’t forget that there is a randomness that I added to it)

```
plt.figure()
plt.plot(x,y,'bo',label='Random Numbers')
plt.xlabel('x')
plt.ylabel('y')
```

[INSERT FIGURE]

### What’s our main question?

So, the question is,

And we can fit it to a polynomial by using `numpy.polyfit()`

, e.g.

```
ORDER = 2
coeffs2 = np.polyfit(x, y, ORDER)
polyFn2 = np.poly1d(coeffs2)
plt.figure()
plt.plot(x,y,'bo'); plt.plot(x,polyFn2(x),'r-')
```

[INSERT FIGURE]

But the question is:

**We can fit multiple different polynomial order models to our data, but which one is the best? How can we decide it?**

[INSERT MULTIPLE FIGURES - 6 orders in 2x3 or 3x2 fashion]

Below we will see only two types of cross-validation being applied

## Leave-One-Out Cross-Validation (LOOCV)

As explained above, the Leave-One-Out method looks at every possibility when you remove a single datapoint, train the model on the \(N-1\) datapoints, and then see how the robust is the model towards that datapoint

*Again, these scripts may not be the most efficient ways of doing cross-validation, but it’s to lay a good framework of how it works*

### Initiate the Results

We can first create a `results`

array, where we will store the results. This array will be in \(\mathbb{R}^{\text{Number of Orders} \times N}\), where \(N\) is the number of datapoints

### Validate it for Every Datapoint

So as stated before, we’re going to iterate through every datapoint, and for each datapoint, we’re going to remove it, and train the model based on the rest.

```
# PARTIAL
# Remove this datapoint
x_mod = np.delete(x,iter)
y_mod = np.delete(y,iter)
# Train the model
coeffs = np.polyfit(x_mod,y_mod,order)
polyFn = np.poly1d(coeffs)
```

We will then predict the labels \(y_{pred}\), and then, by using the mean squared error (we will first compute the squared residual, and later on take the mean)

\[Residual^2 = (y_{true} - y_{pred})^2\]```
# PARTIAL
# Predict the results
y_pred = polyFn(x[iter])
y_true = y[iter]
# Get the residual squared on that single data point that was excluded
thisError = (y_true - y_pred)**2
```

Then we will store that \(residual^2\) on the `results`

array, and we can plot it for funsies too

```
# PARTIAL
# Store on Results
results[order,iter] = thisError
# (OPTIONAL) Plot those predictions
plt.plot(x,polyFn(x),'r-',label='Predicted Points')
```

### Iterate through different polynomial orders

All there is left to do is to let it iterate it through different orders

```
for order in range(high_order):
```

### Complete Validation of Every Point for Every Polynomial Order

Your complete cross-validation will be the following:

```
results = np.zeros((high_order,N))
for order in range(high_order):
# For each polynomial order...
plt.figure(order)
plt.plot(x,y,'bo',label='Modified Points')
for iter in range(len(x)):
# For every data point...
# Remove this datapoint
x_mod = np.delete(x,iter)
y_mod = np.delete(y,iter)
# Train the model
coeffs = np.polyfit(x_mod,y_mod,order)
polyFn = np.poly1d(coeffs)
# Predict the results
y_pred = polyFn(x[iter])
y_true = y[iter]
# Get the residual squared on that single data point that was excluded
thisError = (y_true - y_pred)**2
# Store on Results
results[order,iter] = thisError
# Plot those predictions
plt.plot(x,polyFn(x),'r-',label='Predicted Points')
#plt.plot(x[iter],y[iter],'g*',markersize=20,label='Removed Point')
plt.xlabel('x'); plt.ylabel('y'); plt.title('Polynomial (Order = %d)' % order)
```

[INSERT A FEW OF THE FIGURES 3x2 or 2x3]

### Show Your Results

All you have to do left is to plot your results. Here, remember how I didn’t compute the mean of the MSE? I am now going to compute that mean

```
# Remove zeroth order
results = np.delete(results,0,0)
#Calculate Estimated Prediction Error
PE = np.mean(results,axis=1)
# Show the Estimated Mean Prediction Errors for Each Polynomial
plt.figure()
plt.plot(PE,'bo-')
plt.xlabel('Polynomial Order')
plt.ylabel('Estimated Mean Prediction Error')
```

[INSERT THAT LAST FIGURE!]

## k-Fold Cross-validation

As explained above, the k-Fold cross validation requires one to separate your data into k parts, which may be randomized or not, remove one of the parts, train the data on the rest, and iterate through the different parts.

To do so, you can choose to build your partitioning algorithm from scratch (I highly suggest it as it cranks out your coding kung fu), or you can use SciKit Learn’s `KFold`

### Import that Function

```
from sklearn.model_selection import KFold
```

### Initialize Some Constants

In this case, I’m setting that I’ll repeat the experiment for 20 rounds, and I would like to partition the data into 5 parts. I am also initializing a list `results_list`

, where I’ll store my results

```
rounds = 20
K = 5
results_list = [] # Each element will be for one round
```

### Start Up the KFold Class

So, we gotta start up our KFold class, to start to use its set of tools, and then we can split up our data, which is stored in `x`

```
# Initiate the KFold Class
kf = KFold(n_splits=K, shuffle=True)
# Separate Data into K parts
kf.get_n_splits(x)
```

If you’d like to get some of the status, we can write

```
>>> kf
KFold(n_splits=10, random_state=None, shuffle=True)
```

### Start Validating Your Parts/Folds

For each experiment (i.e `round`

), we will use an array `results`

to get the validation results.

```
results = np.zeros((high_order,K))
```

For each part/fold, we will separate the data into the training set (`x_train`

, `y_train`

) and testing set (`x_test`

, `y_test`

)

```
# PARTIAL
for iter, (train_index, test_index) in enumerate(kf.split(x)):
# For each split...
# Now we have the indices for the training set and the test set
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
```

in which we can plot it for funsies

```
# PARTIAL
# Plot the training set and the test set together
plt.figure()
plt.plot(x_train,y_train,'bo')
plt.plot(x_test,y_test,'rx')
plt.xlabel('x'); plt.ylabel('y')
```

Now for each polynomial order, we can train the model, and create a polynomial function

```
# PARTIAL
for order in range(high_order):
# For each polynomial order...
# Train the model based on the training set and create a polynomial function
coeffs = np.polyfit(x_train,y_train, order )
polyFn = np.poly1d(coeffs)
```

Next, we calculate the predictions \(y_{pred}\), and we can calculate the mean squared error based on the residuals, and we can store that onto the `results`

(numpy array).

```
# Calculate the predictions
y_pred = polyFn(x_test)
# Calculate the estimated prediction error
thisError = np.mean((y_test - y_pred)**2)
# Store on Results
results[order,iter] = thisError
```

Remember, the way that we are storing things is, if we had 3 parts/folds and were testing on two polynomial orders

\[y = \beta_0 + \beta_1 x_1\] \[y = \beta_0 + \beta_1 x + \beta_2 x^2\]All there’s left to do is to store it onto the `results_list`

```
# PARTIAL
# Put it on the list!
results_list.append(thisResult)
```

### Complete Validation

Your complete validation for the K-Fold CV will be now iterating through different experiments (i.e. `rounds`

), where in each experiment, you pull out some random elements, validate it, then in the next experiment, you put everything back into that bag, shake it, and repeat.

```
from sklearn.model_selection import KFold
rounds = 20
K = 5
results_list = [] # Each element will be for one round
for round in range(rounds):
# For each round of results...
# Initiate the KFold Class
kf = KFold(n_splits=K, shuffle=True)
# Separate Data into K parts
kf.get_n_splits(x)
# To get status
kf
# KFold(n_splits=10, random_state=None, shuffle=True)
results = np.zeros((high_order,K))
for iter, (train_index, test_index) in enumerate(kf.split(x)):
# For each split...
# Now we have the indices for the training set and the test set
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
# Plot the training set and the test set together
# plt.figure()
# plt.plot(x_train,y_train,'bo')
# plt.plot(x_test,y_test,'rx')
# plt.xlabel('x'); plt.ylabel('y')
for order in range(high_order):
# For each polynomial order...
# Train the model based on the training set and create a polynomial function
coeffs = np.polyfit(x_train,y_train, order )
polyFn = np.poly1d(coeffs)
title_str = 'Polynomial (Order %d)' % order # For book keeping
# Calculate the predictions
y_pred = polyFn(x_test)
# Calculate the estimated prediction error
thisError = np.mean((y_test - y_pred)**2)
# Store on Results
results[order,iter] = thisError
thisResult = np.mean(results,axis=1)
# Put it on the list!
results_list.append(thisResult)
```

### Showing Your Results

Now, all you have to do is show your results, assuming you ran everything in the “**Big Picture**” section.

```
for result_item in results_list:
plt.plot(polyOrders[1:],result_item[1:],'o-')
plt.ylim((1000,10000))
plt.xlabel('Polynomial Order')
plt.ylabel('PE Value')
```