Categories
Data Science Statistics

VTREAT library up on CRAN

Our R variable treatment library vtreat has been accepted by CRAN! The purpose of the vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to […]

Categories
Data Science Musings Science Statistics

Design, Problem Solving, and Good Taste

I ran across this essay recently on the role of design standards for scientific data visualization. The author, Jure Triglav, draws his inspiration from the creation and continued use of the NYCTA Graphics Standards, which were instituted in the late 1960s to unify the signage for the New York City subway system. As the author […]

Categories
Data Science Statistics Writing

New article up on Win-Vector — Vtreat: a package for variable treatment

We are writing an R package to implement some of the data treatment practices that we discuss in Chapters 4 and 6 of Practical Data Science with R. There’s an article describing the package up on the Win-Vector blog: When you apply machine learning algorithms on a regular basis, on a wide variety of data […]

Categories
Data Science Statistics

New Article on Win-Vector: Trimming the Fat from glm models in R

I have a new article up on the Win-Vector blog, about trimming down the inordinately large models that are produced by R’s glm() function. As with many of our articles, this one was inspired by snags we hit during client work. One of the attractive aspects of logistic regression models (and linear models in general) […]