Categories
Data Science Statistics

VTREAT library up on CRAN

Our R variable treatment library vtreat has been accepted by CRAN! The purpose of the vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to […]

Categories
Data Science Statistics Writing

New on Win-Vector: Checking your Data for Signal

I have a new article up on the Win-Vector Blog, on checking your input variables for signal: An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the […]

Categories
Data Science Statistics Writing

New on Win-Vector: Variable Selection for Sessionized Data

Illustration: Boris Artzybasheff photo: James Vaughan, some rights reserved   I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the […]