New on Win-Vector: A Simpler Explanation of Differential Privacy

NewImage

I have a new article up on Win-Vector, discussing differential privacy and the new recent results on applying differential privacy to enable reuse of holdout data in machine learning.

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

Read the article here.

What is Verification by Multiplicity?

NewImage

There’s been a buzz the last few days about the 715 new planets that NASA has verified, using data from the Kepler Space Telescope. This discovery doubles the number of known planets, and turned up four new planets that could possibly support life.

Beyond the sheer joy of the discovery, one of the interesting aspects of this announcement is the statistical technique that NASA scientists used to winnow out so many planets from the data in bulk: verification by multiplicity. Using this technique, scientists can verify the presence of suspected planets around a star sooner, without having to wait for additional measurements and observations.

I got curious: what is verification by multiplicity? I’m no astronomer, but it’s not too difficult to grasp the basic statistical reasoning behind the method, as described in Lissauer et al. “Almost All of Kepler’s Multiple Planet Candidates Are Planets,” to be published in The Astrophysical Journal on March 10 (a preprint is available at arxiv.org). My discussion isn’t exactly what the researchers did, and I stay with a simple case and avoid the actual astrophysics, but it gets the idea across. I’ll use R to work the example, but you should be able to follow the discussion even if you’re not familiar with that programming language.

The need for statistical verification

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

Read more of this post