Principal Components Regression: A Three-Part Series and Upcoming Talk

Well, since the last time I posted here, the Y-Aware PCR series has grown to three parts! I’m pleased with how it came out. The three parts are as follows:

  • Part 1: A review of standard “x-only” PCR, with a worked example. I also show some issues that can arise with the standard approach.
  • Part 2: An introduction to y-aware scaling to guide PCA in identifying principal components most relevant to the outcome of interest. Y-aware PCA helps alleviate the issues that came up in Part 1.
  • Part 3: How to pick the appropriate number of principal components.

global_4865686

I will also be giving a short talk on y-aware principal components analysis in R at the August Bay Area useR Group meetup on August 9, along with talks by consultant Allan Miller and Jocelyn Barker from Microsoft. It promises to be an interesting evening.

The meetup will be at Guardant Health in Redwood City. Hope to see you there.

Principal Components Regression: A Two-Part Series

Idealproj 1 3

I’m kicking off a two-part series on Principal Components Regression on the Win-Vector blog today. The first article demonstrates some of the pitfalls of using standard Principal Components Analysis in a predictive modeling context. John Mount has posted an introduction to my first article on the Revolutions blog, explaining our motivation in developing this series.

The second article will demonstrate some y-approaches that alleviate the issues that we point out in Part 1.

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an “x-only” decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.

Read more here.

Enjoy.

Upcoming Appearances

We have two public appearances coming up in the next few weeks:

Workshop at ODSC, San Francisco – November 14

John and I will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!

You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.

An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2

I will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.

I’m looking forward to these upcoming appearances, and I hope you can make one or both of them.

Popular Articles on Win-Vector

NewImage19

John has just put up an article on the Win-Vector blog, highlighting some of our popular series of articles, as well as our more popular posts. If you like the articles that I point to on this blog, check out some of the other posts written by John, too.

As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.

What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension.

Our series include:

Check out the original article for more details about these series, and for a pointer to our page of popular posts.

We’ve also updated the company website, so please do visit that, too.

New on Win-Vector: A Simpler Explanation of Differential Privacy

NewImage

I have a new article up on Win-Vector, discussing differential privacy and the new recent results on applying differential privacy to enable reuse of holdout data in machine learning.

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

Read the article here.