New Article on Win-Vector: Trimming the Fat from glm models in R
May 30, 2014
I have a new article up on the Win-Vector blog, about trimming down the inordinately large models that are produced by R’s
glm() function. As with many of our articles, this one was inspired by snags we hit during client work.
One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.
You can read the article here.
My business partner John Mount had an amusing comment to make about our
glm epiphany, borrowed from The Six Stages of Debugging.
1) That can’t happen.
2) That doesn’t happen on my machine.
3) That shouldn’t happen.
4) Why does that happen?
5) Oh, I see.
6) How did that ever work?
Sometimes, you really wonder.