New on Win-Vector: A Simpler Explanation of Differential Privacy


I have a new article up on Win-Vector, discussing differential privacy and the new recent results on applying differential privacy to enable reuse of holdout data in machine learning.

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork’s recent results can be used to improve the model fitting process.

Read the article here.

A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent:

Wanted: A Perfect Scatterplot (with Marginals)

In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best linear fit to the scatterplot:


I also show how to do with it with ggMarginal(), from the ggExtra package.

Working with Sessionized Data 1: Evaluating Hazard Models

This is the start of a mini-series of posts, discussing the analysis of sessionized log data.


Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

Click on the links to read.


Balancing Classes Before Training Classifiers – Addressing a Folk Theorem

NewImage We’ve been wanting to get more into training over at Win-Vector, but I don’t want to completely give up client work, because clients and their problems are often the inspiration for cool solutions — and good blog articles. Working on the video course for the last couple of months has given me some good ideas, too.

A lot of my recreational writing revolves around folklore and superstition — the ghosty, monster-laden kind. Engineers and statisticians have their own folk beliefs, too: things we wish were true, totemistic practices we believe help. Sometimes there’s a rational basis for those beliefs, sometimes, there isn’t. My latest Win-Vector blog post is about one such folk theorem.

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.

For some problems, with some classifiers, it does help — but for others, it doesn’t. I’ve already gotten a great, thoughtful comment on the post, that helps articulate possible reasons behind my results. It’s good for us to introspect sometimes about our techniques and practices, rather than just blindly asserting that “this is how we do it.” Because even when we’re right, sometimes we’re right for the wrong reasons, which to me is worse than simply being wrong.

Read the post here.

New Data Science Video Course

Data Science 2

John Mount and I are proud to announce our new data science video course, Introduction to Data Science, now available through Udemy!

The course is 28 lectures, totaling over five hours long. We cover the use of common predictive modeling algorithms in R, including linear and logistic regression, random forest, and gradient boosting. We also show how to validate and evaluate the models that you’ve fit. In addition, we discuss data treatment, especially how to deal with missing values, and data visualization.

To celebrate the launch, we have a limited-time half-off offer, available through this link.

Two New Articles

Two new articles, one on the Win-Vector blog, plus a guest post on the Fliptop blog:

Random Test/Train Split is not Always Enough discusses the potential limitations of a randomized test/train split when your training data and future data are not truly exchangeable, due to time dependent effects, serial correlation, concept changes, or data-grouping.

Don’t Use Black-Box Testing to Select a Predictive Lead Scoring Vendor is a commissioned piece for one of our clients, and hosted on their blog. This is related to the first post: if you are running an evaluation of a potential vendor’s decision system, then that test should reflect the environment in which the decision system will be deployed. In particular, if your data has any of the non-exchangeablity properties that we discuss above, then your evaluation setup should reflect that.