Announcing Practical Data Science with R, 2nd Edition

I’ve told a few people privately, but now I can announce it publicly: we are working on the second edition of Practical Data Science with R!

Practical Data Science with R, 2nd edition

Manning Publications has just launched the the MEAP for the second edition. The MEAP (Manning Early Access Program) allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, don’t worry. If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

Manning is sharing a 50% off promotion code active until August 23, 2018: mlzumel3.

A Trunkful of Win-Vector R Packages

Suitcasestickers

If you follow the Win-Vector blog, you know that we have developed a number of R packages that encapsulate our data science working process and philosophy. The biggest package, of course, is our data preparation package, vtreat, which implements many of the data treatment principles that I describe in my white-paper, here. Read more of this post

New Win-Vector Package replyr: for easier dplyr

Using dplyr with a specific data frame, where all the columns are known, is an effective and pleasant way to execute declarative (SQL-like) operations on dataframes and dataframe-like objects in R. It also has the advantage of working not only on local data, but also on dplyr-supported remote data stores, like SQL databases or Spark.

However, once we know longer know the column names, the pleasure quickly fades. The currently recommended way to handle dplyr‘s non-standard evaluation is via the lazyeval package. This is not pretty. I never want to write anything like the following, ever again.

# target is a moving target, so to speak
target = "column_I_want"

library(lazyeval)

# return all the rows where target column is NA
dframe %>%
  filter_(interp(~ is.na(col), col=as.name(target)) ) 

This example is fairly simple, but the more complex the dplyr expression, and the more columns involved, the more unwieldy the lazyeval solution becomes.

The difficulty of parameterizing dplyr expressions is part of the motivation for Win-Vector’s new package, replyr. I’ve just posted an article to the Win-Vector blog, on the function replyr::let, which lets us parametrize dplyr expressions without lazyeval.

Read more of this post

Principal Components Regression: A Three-Part Series and Upcoming Talk

Well, since the last time I posted here, the Y-Aware PCR series has grown to three parts! I’m pleased with how it came out. The three parts are as follows:

  • Part 1: A review of standard “x-only” PCR, with a worked example. I also show some issues that can arise with the standard approach.
  • Part 2: An introduction to y-aware scaling to guide PCA in identifying principal components most relevant to the outcome of interest. Y-aware PCA helps alleviate the issues that came up in Part 1.
  • Part 3: How to pick the appropriate number of principal components.

global_4865686

I will also be giving a short talk on y-aware principal components analysis in R at the August Bay Area useR Group meetup on August 9, along with talks by consultant Allan Miller and Jocelyn Barker from Microsoft. It promises to be an interesting evening.

The meetup will be at Guardant Health in Redwood City. Hope to see you there.

Principal Components Regression: A Two-Part Series

Idealproj 1 3

I’m kicking off a two-part series on Principal Components Regression on the Win-Vector blog today. The first article demonstrates some of the pitfalls of using standard Principal Components Analysis in a predictive modeling context. John Mount has posted an introduction to my first article on the Revolutions blog, explaining our motivation in developing this series.

The second article will demonstrate some y-approaches that alleviate the issues that we point out in Part 1.

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an “x-only” decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.

Read more here.

Enjoy.