Sequenced Regression and Airflow XComs
This weekend I’ve been working through a short Kaggle project to predict predict student math and writing grades using a small dataset.
Since the goal was to predict both math and writing scores, I could either approach this with two separate models, or through a multivariate regression approach. Since the scores are correlated, I wanted to experiment with the latter.
There are several sklearn models that natively support multiple targets (LinearRegression, KNeighborsRegressor, and RandomForestRegressor, for example). I tried those out, but I also wanted to look into sequenced regression to be able to try an SVM. This is basically where the model predicts on the first target, then passes that result along with the original features to a second model for the second target, and so on. This can be done pretty simply in sklearn with sklearn.multioutput.RegressorChain
. A sequenced LinearSVR model turned out to be the most successful, so it was a worthwhile thing to play around with.
I also set up a little Airflow pipeline to prepare the data, tune the hyperparams, train the model, predict on the test data, and evaluate model performance. I use Airflow almost every day at work, so this was pretty par for the course, but I did get to dig into XCOMs, which I haven’t used in the past. So that was fun.
All the code from this project is here.