A Trunkful of Win-Vector R Packages


If you follow the Win-Vector blog, you know that we have developed a number of R packages that encapsulate our data science working process and philosophy. The biggest package, of course, is our data preparation package, vtreat, which implements many of the data treatment principles that I describe in my white-paper, here. Read more of this post

John Oliver on Scientific Studies

An excellent rant from John Oliver on the way science stories are handled in the media, and on the need for some healthy skepticism. And the need to track down sources for the studies yourself, to the extent that this is possible.

Also, I love the “TODD Talks” skit at the end.

Design, Problem Solving, and Good Taste


Image: A Case for Spaceships (Jure Triglav)

I ran across this essay recently on the role of design standards for scientific data visualization. The author, Jure Triglav, draws his inspiration from the creation and continued use of the NYCTA Graphics Standards, which were instituted in the late 1960s to unify the signage for the New York City subway system. As the author puts it, the Graphics Standards Manual is “a timeless example of great design elegantly solving a real problem.” Thanks to the unified iconography, a traveler on the New York subway knows exactly what to look for to navigate the subway system, no matter which station they may be in. And the iconography is beautiful, too.


Unimark, the design company that designed the Graphics Standards.
Aren’t they a hip, mod looking group? And I’m jealous of those lab coats.
Image: A Case for Spaceships (Jure Triglav)

What works to clarify subway travel will work to clarify the morass of graphs and charts that pass for scientific visualization, Triglav argues. And we should start with the work of the Joint Committee on Standards for Graphical Presentation, a group of statisticians, engineers, scientists, and mathematicians who first adopted a set of standards in 1914, revised in 1936, 1938, and 1960.

I agree with him — mostly.

Read more of this post

Kepler’s Planetary Bonanza on SciShow

I meant to post this earlier, but it slipped down the list. If you haven’t seen it yet, SciShow did a segment earlier this month on the bonanza of planets discovered via the Kepler Space Telescope. My article on verification by multiplicity got a special shout-out in the credits on the YouTube page (see the about tab)!




What is Verification by Multiplicity?


There’s been a buzz the last few days about the 715 new planets that NASA has verified, using data from the Kepler Space Telescope. This discovery doubles the number of known planets, and turned up four new planets that could possibly support life.

Beyond the sheer joy of the discovery, one of the interesting aspects of this announcement is the statistical technique that NASA scientists used to winnow out so many planets from the data in bulk: verification by multiplicity. Using this technique, scientists can verify the presence of suspected planets around a star sooner, without having to wait for additional measurements and observations.

I got curious: what is verification by multiplicity? I’m no astronomer, but it’s not too difficult to grasp the basic statistical reasoning behind the method, as described in Lissauer et al. “Almost All of Kepler’s Multiple Planet Candidates Are Planets,” to be published in The Astrophysical Journal on March 10 (a preprint is available at arxiv.org). My discussion isn’t exactly what the researchers did, and I stay with a simple case and avoid the actual astrophysics, but it gets the idea across. I’ll use R to work the example, but you should be able to follow the discussion even if you’re not familiar with that programming language.

The need for statistical verification

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

Read more of this post