VTREAT library up on CRAN

IMG 6061

Our R variable treatment library vtreat has been accepted by CRAN!

The purpose of the vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to produce a data.frame that is entirely numeric and takes common precautions to guard against the following real world data issues…

Read more details about the library, including how to install it from CRAN or from github, at the Win-Vector blog, here.

New on Win-Vector: Checking your Data for Signal

NewImage8

I have a new article up on the Win-Vector Blog, on checking your input variables for signal:

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.

Read the article here.

Another underlying motivation for this article is to encourage giving empirical intuition for common statistical procedures, like testing for significance — in this case testing that your model against the null hypothesis that you are fitting to pure noise. As a data scientist, you may or may not use my suggested heuristic for variable selection, but it’s good to get in the habit of thinking about the things you measure, and not just how to take the measurements, but why.

New on Win-Vector: Variable Selection for Sessionized Data

NewImage

Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved

 

I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the training data. In this post, we look at variable selections strategies for this situation — or for any very wide data situation, really.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

Read the rest of the post here.

Enjoy.