VTREAT library up on CRAN

IMG 6061

Our R variable treatment library vtreat has been accepted by CRAN!

The purpose of the vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to produce a data.frame that is entirely numeric and takes common precautions to guard against the following real world data issues…

Read more details about the library, including how to install it from CRAN or from github, at the Win-Vector blog, here.

Design, Problem Solving, and Good Taste

Subway

Image: A Case for Spaceships (Jure Triglav)

I ran across this essay recently on the role of design standards for scientific data visualization. The author, Jure Triglav, draws his inspiration from the creation and continued use of the NYCTA Graphics Standards, which were instituted in the late 1960s to unify the signage for the New York City subway system. As the author puts it, the Graphics Standards Manual is “a timeless example of great design elegantly solving a real problem.” Thanks to the unified iconography, a traveler on the New York subway knows exactly what to look for to navigate the subway system, no matter which station they may be in. And the iconography is beautiful, too.

Unimark

Unimark, the design company that designed the Graphics Standards.
Aren’t they a hip, mod looking group? And I’m jealous of those lab coats.
Image: A Case for Spaceships (Jure Triglav)

What works to clarify subway travel will work to clarify the morass of graphs and charts that pass for scientific visualization, Triglav argues. And we should start with the work of the Joint Committee on Standards for Graphical Presentation, a group of statisticians, engineers, scientists, and mathematicians who first adopted a set of standards in 1914, revised in 1936, 1938, and 1960.

I agree with him — mostly.

Read more of this post

New article up on Win-Vector — Vtreat: a package for variable treatment

We are writing an R package to implement some of the data treatment practices that we discuss in Chapters 4 and 6 of Practical Data Science with R. There’s an article describing the package up on the Win-Vector blog:

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

  • Missing values (NA or blanks)
  • Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)
  • Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
  • Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

You can read the article here. Enjoy.

New Article on Win-Vector: Trimming the Fat from glm models in R

I have a new article up on the Win-Vector blog, about trimming down the inordinately large models that are produced by R’s glm() function. As with many of our articles, this one was inspired by snags we hit during client work.

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

You can read the article here.

My business partner John Mount had an amusing comment to make about our glm epiphany, borrowed from The Six Stages of Debugging.

1) That can’t happen.

2) That doesn’t happen on my machine.

3) That shouldn’t happen.

4) Why does that happen?

5) Oh, I see.

6) How did that ever work?

Sometimes, you really wonder.