Practical Data Science with R, 2nd Edition — New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! This makes available six chapter in total accessible to MEAP subscribers

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

I’ve told a few people privately, but now I can announce it publicly: we are working on the second edition of Practical Data Science with R!

Manning Publications has just launched the the MEAP for the second edition. The MEAP (Manning Early Access Program) allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, don’t worry. If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

A Trunkful of Win-Vector R Packages


If you follow the Win-Vector blog, you know that we have developed a number of R packages that encapsulate our data science working process and philosophy. The biggest package, of course, is our data preparation package, vtreat, which implements many of the data treatment principles that I describe in my white-paper, here. Read more of this post

New Win-Vector Package replyr: for easier dplyr

Using dplyr with a specific data frame, where all the columns are known, is an effective and pleasant way to execute declarative (SQL-like) operations on dataframes and dataframe-like objects in R. It also has the advantage of working not only on local data, but also on dplyr-supported remote data stores, like SQL databases or Spark.

However, once we know longer know the column names, the pleasure quickly fades. The currently recommended way to handle dplyr‘s non-standard evaluation is via the lazyeval package. This is not pretty. I never want to write anything like the following, ever again.

# target is a moving target, so to speak
target = "column_I_want"


# return all the rows where target column is NA
dframe %>%
  filter_(interp(~, ) 

This example is fairly simple, but the more complex the dplyr expression, and the more columns involved, the more unwieldy the lazyeval solution becomes.

The difficulty of parameterizing dplyr expressions is part of the motivation for Win-Vector’s new package, replyr. I’ve just posted an article to the Win-Vector blog, on the function replyr::let, which lets us parametrize dplyr expressions without lazyeval.

Upcoming Talks

I will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27.

The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and to seeing those of you who can attend.