New Article on Win-Vector: Trimming the Fat from glm models in R

I have a new article up on the Win-Vector blog, about trimming down the inordinately large models that are produced by R’s glm() function. As with many of our articles, this one was inspired by snags we hit during client work.

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

You can read the article here.

My business partner John Mount had an amusing comment to make about our glm epiphany, borrowed from The Six Stages of Debugging.

1) That can’t happen.

2) That doesn’t happen on my machine.

3) That shouldn’t happen.

4) Why does that happen?

5) Oh, I see.

6) How did that ever work?

Sometimes, you really wonder.

About nzumel
I dance. I'm a data scientist. I'm a dancing data scientist. In my spare time, I like to read folklore (and research about folklore), ghost stories, random cognitive science papers, and to sometimes blog about it all.

Comments are closed.

%d bloggers like this: