Regression Splines

1 minute read

What I Learned: The Basics of Regression Splines

A project I’m working on at the moment models the relationship between tenure, pay, age, and personality (along with some control variables) and job satisfaction for public sector employees. It’s a longitudinal setup, and we’re particularly interested in looking into nonlinear relationships that might be present in the data. The option of using splines has been brought up, which I haven’t used before. Splines also briefly came up in lecture 4 of Rethinking Statistics, a Bayesian statistics course I’ve been working through.

So, today I spent some time looking into them.

There are many kinds of splines, but essentially they aim to fit a smooth curve through irregular data points. Unlike with polynomial regression where a polynomial is fitted globally across the entire range of \(X\), splines divide \(X\) into continuous intervals, each of which is represented by a separate polynomial function. This is known as piecewise polynomials.

The “cut-off” points for the different ranges of \(X\) are known as “knots,” the ideal number of which is chosen through cross-validation. Discontinuity at the knots can be avoided through mathematical constraints to enforce qualities of smoothness. Cubic splines, for example, must be continuous and have continuous first and second derivatives, which eliminates the possibility for jumps at the knots.

One problem with cubic splines is that the boundary regions (below the lowest knot and above the highest knot) can be highly variable, with larger standard errors for confidence intervals. Natural cubic splines improve in this area by requiring polynomials that are linear in the boundary regions, thus leading to lower variability.

Another term that comes up often, “smoothing splines,” basically refers to a spline that also incorporates regularization by adding a roughness penalty term to the least squares criterion.

Splines can be customized and adapted in a variety of ways, and have numerous advantages over polynomial regression, which tends to overfit and may be overly complex in models with many features.