Cross-validation¶

This notebook shows how to use cross-validation techniques from Scikit-learn to tune parameters.

Note that the pandas package is required to run this notebook.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets

import pysensors as ps

Setup¶

First we’ll load some training data. In this case, images of handwritten digits.

digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target

n_samples, n_features = X.shape
print(n_samples, n_features)

1083 64

# Plot some digits
n_img_per_row = 10
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
plt.show()

Grid search¶

Here we specify a set of parameter values we’d like to test. The GridSearchCV object will try out every possible combination and tell us which gave the best performance.

from sklearn.model_selection import GridSearchCV

model = ps.reconstruction.SSPOR()

param_grid = {
    "basis": [ps.basis.Identity(), ps.basis.SVD(), ps.basis.RandomProjection()],
    "basis__n_basis_modes": [20, 30, 40, 50],
    "n_sensors": [10, 20, 30, 40]
}

search = GridSearchCV(model, param_grid)
search.fit(X, quiet=True)
print("Best parameters:")
for k, v in search.best_params_.items():
    print(f"{k}:  {v}")

Best parameters:
basis:  SVD(n_basis_modes=40)
basis__n_basis_modes:  40
n_sensors:  40

We can also visualize the performance of each candidate model to get a better idea how the different parameters interact. The pandas and seaborn packages help simplify the task.

def rename_bases(s):
        if 'Identity' in str(s):
            return 'Identity'
        elif 'SVD' in str(s):
            return 'SVD'
        elif 'RandomProjection' in str(s):
            return 'RandomProjection'
        else:
            return "N/A"

def create_dataframe(results):
    results_df = pd.DataFrame.from_dict(results)

    # Rename the basis objects for plotting purposes
    results_df['basis'] = results_df['param_basis'].apply(rename_bases)

    # We were working with the negative mean-square error before
    results_df['neg_mean_test_score'] = -results_df['mean_test_score']

    return results_df

def generate_summary_plots(results):
    results = create_dataframe(results)

    fig, axs = plt.subplots(1, 2, figsize=(14, 6))
    sns.barplot(data=results, x='param_basis__n_basis_modes', y='neg_mean_test_score', hue='basis', ax=axs[0])
    sns.barplot(data=results, x='param_n_sensors', y='neg_mean_test_score', hue='basis', ax=axs[1])
    axs[0].set(xlabel='n_basis_modes', ylabel='Mean square error')
    axs[1].set(xlabel='n_sensors', ylabel='Mean square error');

generate_summary_plots(search.cv_results_)

The error bars in the left subplot come from different choices of n_sensors. Those on the right come from different values of n_basis_modes. Note that the performance of the SVD basis is more sensitive to low sensor counts than the other two bases, but it does very well once enough sensors are in place.

We can also look more closely at how the different bases are affected by differing numbers of sensors.

# Rename basis objects
results = create_dataframe(search.cv_results_)

# Initialize a grid with a plot for each number of basis modes
grid = sns.FacetGrid(results, col="param_basis__n_basis_modes", hue="basis", legend_out=False)

# Create a plot for each instance
grid.map(plt.plot, "param_n_sensors", "neg_mean_test_score", marker="o").add_legend()
grid.fig.tight_layout()
plt.show()

SVD - Performance generally increases as the number of sensors are increased. For low sensor counts, this basis is beat out by the other two.
Identity - Works best when lots of modes are retained, even for low values of n_sensors.
RandomProjection - Similar performance to the Identity basis on this example.

Randomized search¶

Now suppose we have a fixed sensor budget. We’d like to determine the best basis in which to represent the data to make the most of these sensors. Suppose further that we want to test out a larger number of candidates for n_basis_modes. We can obtain a good approximation to the optimal parameter configuration for a fraction of the cost using Scikit-learn’s RandomizedSearchCV object.

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

model = ps.reconstruction.SSPOR(n_sensors=30)

param_distributions = {
    "basis": [ps.basis.Identity(), ps.basis.SVD(), ps.basis.RandomProjection()],
    "basis__n_basis_modes": randint(10, 50)
}

search = RandomizedSearchCV(model, param_distributions, n_iter=30)
search.fit(X, quiet=True)
for k, v in search.best_params_.items():
    print(f"{k}:  {v}")

basis:  SVD(n_basis_modes=22)
basis__n_basis_modes:  22

results = pd.DataFrame.from_dict(search.cv_results_)

# Rename the basis objects for plotting purposes
def rename_bases(s):
    if 'Identity' in str(s):
        return 'Identity'
    elif 'SVD' in str(s):
        return 'SVD'
    elif 'RandomProjection' in str(s):
        return 'RandomProjection'
    else:
        return "N/A"

results['basis'] = results['param_basis'].apply(rename_bases)

# We were working with the negative mean-square error before
results['neg_mean_test_score'] = -results['mean_test_score']

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
sns.lineplot(
    markers=True,
    data=results,
    x='param_basis__n_basis_modes',
    y='neg_mean_test_score',
    hue='basis',
    ax=ax
)
ax.set(xlabel='n_basis_modes', ylabel='Mean square error', title='30 sensors');

The Identity and RandomProjection bases appear to benefit from increasing the number of basis modes. Once the number of basis modes exceeds the number of sensors, the performance of the SVD basis drops off. Its accuracy curve shows that it is adept at compressing information as it attains good scores for low numbers of basis modes.

Note that the basis is recomputed for each value of n_basis_modes, which may create misleading results for the RandomProjection basis. Normally to increase the size of the basis one would simply add a new randomly projection vector to the existing basis, in which case we would expect to see something much closer to a monotonoic decrease in mean-square error in as n_basis_modes is increased.

See this notebook for a more in-depth comparison of the bases.

Download python script: cross_validation.py

Download Jupyter notebook: cross_validation.ipynb