Principal Components Regression: A Three-Part Series and Upcoming Talk

Well, since the last time I posted here, the Y-Aware PCR series has grown to three parts! I’m pleased with how it came out. The three parts are as follows:

  • Part 1: A review of standard “x-only” PCR, with a worked example. I also show some issues that can arise with the standard approach.
  • Part 2: An introduction to y-aware scaling to guide PCA in identifying principal components most relevant to the outcome of interest. Y-aware PCA helps alleviate the issues that came up in Part 1.
  • Part 3: How to pick the appropriate number of principal components.

global_4865686

I will also be giving a short talk on y-aware principal components analysis in R at the August Bay Area useR Group meetup on August 9, along with talks by consultant Allan Miller and Jocelyn Barker from Microsoft. It promises to be an interesting evening.

The meetup will be at Guardant Health in Redwood City. Hope to see you there.

Principal Components Regression: A Two-Part Series

Idealproj 1 3

I’m kicking off a two-part series on Principal Components Regression on the Win-Vector blog today. The first article demonstrates some of the pitfalls of using standard Principal Components Analysis in a predictive modeling context. John Mount has posted an introduction to my first article on the Revolutions blog, explaining our motivation in developing this series.

The second article will demonstrate some y-approaches that alleviate the issues that we point out in Part 1.

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an “x-only” decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.

Read more here.

Enjoy.

John Oliver on Scientific Studies

An excellent rant from John Oliver on the way science stories are handled in the media, and on the need for some healthy skepticism. And the need to track down sources for the studies yourself, to the extent that this is possible.

Also, I love the “TODD Talks” skit at the end.

Upcoming Webinar: Data Preparation with R

I’m happy to announce my upcoming webinar, sponsored by Microsoft Data Science:

Data Preparation with R
Thursday, March 17, 2016 10:00 A.M. – 11:00 A.M. (Pacific time)

Data quality is the single most important item to the success of your data science project. Preparing data for analysis is one of the most important, laborious and yet, neglected aspects of data science. Many of the routine steps can be automated in a principled manner. This webinar will lay out the statisitcal fundamentals of preparing data. Our speaker, Nina Zumel, principal consultant and co-founder of Win-Vector, LLC, will cover what goes wrong with data and how you can detect the problems and fix them.

Details and registration here. I’m looking forward to it!

Starting Strong in 2016

464px We Can Do It

We had a busy January here at Win-Vector, and it shows no sign of abating. John and I had the pleasure of attending the first Shiny Developers Conference, held by RStudio and hosted at Stanford University (see here for a review of the conference, by a fellow attendee). The event energized us to resharpen our Shiny skills, and I’ve put together a little gallery of the Shiny apps that we’ve developed and featured on the Win-Vector blog. It’s a small gallery at the moment, but I expect it will grow.

In addition, I gave a repeat presentation of the Differential Privacy talk that I gave to the Bay Area Women in Data Science and Machine Learning Meetup last December, and am gearing up for a planned webinar on Prepping Data for Analysis in R (the webinar has not yet been announced by the hosts — more details soon).

And I’ve managed to slip in a couple of Win-Vector blog posts, too:

Using PostgreSQL in R: A quick how-to

Finding the K in K-means by Parametric Bootstrap (with Shiny app!)

We are also looking forward to giving a presentation at the ODSC San Francisco Meetup on March 31, and participating in the R Day all-day tutorial at Strata/Hadoop World Santa Clara on March 29.

2016 is shaping up to be a good year.


Image: World War II era poster by J. Howard Miller. Source: Wikipedia