Balancing Classes Before Training Classifiers – Addressing a Folk Theorem

NewImage We’ve been wanting to get more into training over at Win-Vector, but I don’t want to completely give up client work, because clients and their problems are often the inspiration for cool solutions — and good blog articles. Working on the video course for the last couple of months has given me some good ideas, too.

A lot of my recreational writing revolves around folklore and superstition — the ghosty, monster-laden kind. Engineers and statisticians have their own folk beliefs, too: things we wish were true, totemistic practices we believe help. Sometimes there’s a rational basis for those beliefs, sometimes, there isn’t. My latest Win-Vector blog post is about one such folk theorem.

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.

For some problems, with some classifiers, it does help — but for others, it doesn’t. I’ve already gotten a great, thoughtful comment on the post, that helps articulate possible reasons behind my results. It’s good for us to introspect sometimes about our techniques and practices, rather than just blindly asserting that “this is how we do it.” Because even when we’re right, sometimes we’re right for the wrong reasons, which to me is worse than simply being wrong.

Read the post here.

Recent post on Win-Vector blog, plus some musings on Audience



I put a new post up on Win-Vector a couple of days ago called “The Geometry of Classifiers”, a follow-up post to a recent paper by Fernandez-Delgado, et al. that investigates several classifiers against a body of data sets, mostly from the UCI Machine Learning Repository. Our article follows up the study with seven additional additional classifier implementations from scikit-learn and an interactive Shiny app to explore the results.

As you might guess, we did our little study not only because we were interested in the questions of classifier performance and classifier similarity, but because we wanted an excuse to play with scikit-learn and Shiny. We’re proud of the results (the app is cool!), but we didn’t consider this an especially ground-breaking post. Much to our surprise, this article got over 2000 views the day we posted it (a huge number, for us), up to nearly 3000 as I write this. It’s already our eighth most popular post of this year (an earlier post by John on the Fernandez-Delgado paper, a comment about some of their data treatment is also doing quite well: #2 for the month and #21 for the year).

Read more of this post