Categories
Data Science Statistics

New on Win-Vector: A Simpler Explanation of Differential Privacy

I have a new article up on Win-Vector, discussing differential privacy and the new recent results on applying differential privacy to enable reuse of holdout data in machine learning. Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from […]

Categories
Data Science Statistics Writing

A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent: Wanted: A Perfect Scatterplot (with Marginals) In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best […]

Categories
Data Science Statistics

Balancing Classes Before Training Classifiers – Addressing a Folk Theorem

We’ve been wanting to get more into training over at Win-Vector, but I don’t want to completely give up client work, because clients and their problems are often the inspiration for cool solutions — and good blog articles. Working on the video course for the last couple of months has given me some good ideas, […]

Categories
Data Science

New Data Science Video Course

John Mount and I are proud to announce our new data science video course, Introduction to Data Science, now available through Udemy! The course is 28 lectures, totaling over five hours long. We cover the use of common predictive modeling algorithms in R, including linear and logistic regression, random forest, and gradient boosting. We also […]

Categories
Data Science Statistics Writing

Two New Articles

Two new articles, one on the Win-Vector blog, plus a guest post on the Fliptop blog: Random Test/Train Split is not Always Enough discusses the potential limitations of a randomized test/train split when your training data and future data are not truly exchangeable, due to time dependent effects, serial correlation, concept changes, or data-grouping. Don’t […]