If you follow the Win-Vector blog, you know that we have developed a number of R packages that encapsulate our data science working process and philosophy. The biggest package, of course, is our data preparation package, `vtreat`

, which implements many of the data treatment principles that I describe in my white-paper, here.We’ve also got packages for managing non-standard evaluation environments, such as `dplyr`

; for reporting model summaries and statistics; for conveniently generating common statistical visualizations (at least, the ones we use a lot); and for step-debugging.

Now, all these packages are up on CRAN. Hopefully, you will find this functionality as useful as we do.

Please see this post on the Win-Vector blog for links to descriptions of all our packages.

]]>`dplyr`

with a specific data frame, where all the columns are known, is an effective and pleasant way to execute declarative (SQL-like) operations on dataframes and dataframe-like objects in R. It also has the advantage of working not only on local data, but also on `dplyr`

-supported remote data stores, like SQL databases or Spark.
However, once we know longer know the column names, the pleasure quickly fades. The currently recommended way to handle `dplyr`

‘s non-standard evaluation is via the `lazyeval`

package. This is not pretty. I never want to write anything like the following, ever again.

# target is a moving target, so to speak target = "column_I_want" library(lazyeval) # return all the rows where target column is NA dframe %>% filter_(interp(~ is.na(col), col=as.name(target)) )

This example is fairly simple, but the more complex the `dplyr`

expression, and the more columns involved, the more unwieldy the `lazyeval`

solution becomes.

The difficulty of parameterizing `dplyr`

expressions is part of the motivation for Win-Vector’s new package, `replyr`

. I’ve just posted an article to the Win-Vector blog, on the function `replyr::let`

, which lets us parametrize `dplyr`

expressions without `lazyeval`

.

target = "column_I_want" library(replyr) # return all the rows where target column is NA let(alias = list(col=target), expr = dframe %>% filter(is.na(col)) )()

The `dplyr`

expression is no more complicated than the equivalent expression when the columns are known, and using multiple columns only involves additional entries in the `alias`

mapping.

`replyr`

is a new package, and it is still going through growing pains as we figure out the best ways to implement desired functionality. We welcome suggestions for new functions, and more efficient or more general ways to implement the functionality that we supply.

The talk is called *Improving Prediction using Nested Models and Simulated Out-of-Sample Data*.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called *A Unified View of Model Evaluation* at **ODSC West 2016 on November 4** (the premium workshop sessions), and **November 5** (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and to seeing those of you who can attend.

]]>We can’t read it, of course, but it’s cool (and a bit intimidating) to see what our work looks like in another language and character set. Here are a couple of peeks inside, just for fun.

I wonder if Manning is planning any other translated editions? I’ll keep you posted.

]]>- Part 1: A review of standard “x-only” PCR, with a worked example. I also show some issues that can arise with the standard approach.
- Part 2: An introduction to y-aware scaling to guide PCA in identifying principal components most relevant to the outcome of interest. Y-aware PCA helps alleviate the issues that came up in Part 1.
- Part 3: How to pick the appropriate number of principal components.

I will also be giving a short talk on y-aware principal components analysis in R at the** August Bay Area useR Group meetup** on August 9, along with talks by consultant Allan Miller and Jocelyn Barker from Microsoft. It promises to be an interesting evening.

The meetup will be at Guardant Health in Redwood City. Hope to see you there.

]]>I’m kicking off a two-part series on Principal Components Regression on the Win-Vector blog today. The first article demonstrates some of the pitfalls of using standard Principal Components Analysis in a predictive modeling context. John Mount has posted an introduction to my first article on the Revolutions blog, explaining our motivation in developing this series.

The second article will demonstrate some *y*-approaches that alleviate the issues that we point out in Part 1.

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an “x-only” decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.

Enjoy.

]]>Also, I love the “TODD Talks” skit at the end.

]]>**Data Preparation with R**

Thursday, March 17, 2016 10:00 A.M. – 11:00 A.M. (Pacific time)

Data quality is the single most important item to the success of your data science project. Preparing data for analysis is one of the most important, laborious and yet, neglected aspects of data science. Many of the routine steps can be automated in a principled manner. This webinar will lay out the statisitcal fundamentals of preparing data. Our speaker, Nina Zumel, principal consultant and co-founder of Win-Vector, LLC, will cover what goes wrong with data and how you can detect the problems and fix them.

Details and registration here. I’m looking forward to it!

]]>We had a busy January here at Win-Vector, and it shows no sign of abating. John and I had the pleasure of attending the first Shiny Developers Conference, held by RStudio and hosted at Stanford University (see here for a review of the conference, by a fellow attendee). The event energized us to resharpen our Shiny skills, and I’ve put together a little gallery of the Shiny apps that we’ve developed and featured on the Win-Vector blog. It’s a small gallery at the moment, but I expect it will grow.

In addition, I gave a repeat presentation of the Differential Privacy talk that I gave to the Bay Area Women in Data Science and Machine Learning Meetup last December, and am gearing up for a planned webinar on Prepping Data for Analysis in R (the webinar has not yet been announced by the hosts — more details soon).

And I’ve managed to slip in a couple of Win-Vector blog posts, too:

**Using PostgreSQL in R: A quick how-to**

**Finding the K in K-means by Parametric Bootstrap** (with Shiny app!)

We are also looking forward to giving a presentation at the ODSC San Francisco Meetup on March 31, and participating in the R Day all-day tutorial at Strata/Hadoop World Santa Clara on March 29.

2016 is shaping up to be a good year.

Image: World War II era poster by J. Howard Miller. Source: Wikipedia

]]>**Workshop at ODSC, San Francisco – November 14**

John and I will be giving a two-hour workshop called *Preparing Data for Analysis using R: Basic through Advanced Techniques*. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!

You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.

**An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2**

I will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.

I’m looking forward to these upcoming appearances, and I hope you can make one or both of them.

]]>