On Persistence and Sincerity

5245227711 370acc245e z
…propaganda, Boris Artzybasheff. Image: James Vaughan, some rights reserved.

We’re in the middle of marketing efforts here at Win-Vector, and I’ve just spent a few hours going through the Win-Vector blog so I could update our Popular Articles page (I have to do that for Multo, someday, too).

As I went through the blog, I had a number of thoughts:

  • Wow, this is a lot of posts.
  • Wow, we write about a lot of topics.
  • Wow, this is some really great stuff!

I can’t take credit for all that. The Win-Vector blog is John’s baby; he started it way back in July of 2007, and as it’s his only blog, it’s his primary mode of expression (Facebook for cooking, Win-Vector for the techy stuff). He writes more of the posts than I do. But the blog has been good for some of my hobby horses, too.[1]

The excuse for the Win-Vector blog is that it’s “marketing” for the company. And it is; we promote ourselves sometimes: our company, our book, our video courses. But mostly it’s here because we wanted a place to talk about what we care about, and a place to share things we thought would help other people.

Read more of this post

New on Win-Vector: Checking your Data for Signal


I have a new article up on the Win-Vector Blog, on checking your input variables for signal:

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.

Read the article here.

Another underlying motivation for this article is to encourage giving empirical intuition for common statistical procedures, like testing for significance — in this case testing that your model against the null hypothesis that you are fitting to pure noise. As a data scientist, you may or may not use my suggested heuristic for variable selection, but it’s good to get in the habit of thinking about the things you measure, and not just how to take the measurements, but why.

New on Win-Vector: Variable Selection for Sessionized Data


Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved


I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the training data. In this post, we look at variable selections strategies for this situation — or for any very wide data situation, really.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

Read the rest of the post here.


A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent:

Wanted: A Perfect Scatterplot (with Marginals)

In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best linear fit to the scatterplot:


I also show how to do with it with ggMarginal(), from the ggExtra package.

Working with Sessionized Data 1: Evaluating Hazard Models

This is the start of a mini-series of posts, discussing the analysis of sessionized log data.


Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

Click on the links to read.


Two New Articles

Two new articles, one on the Win-Vector blog, plus a guest post on the Fliptop blog:

Random Test/Train Split is not Always Enough discusses the potential limitations of a randomized test/train split when your training data and future data are not truly exchangeable, due to time dependent effects, serial correlation, concept changes, or data-grouping.

Don’t Use Black-Box Testing to Select a Predictive Lead Scoring Vendor is a commissioned piece for one of our clients, and hosted on their blog. This is related to the first post: if you are running an evaluation of a potential vendor’s decision system, then that test should reflect the environment in which the decision system will be deployed. In particular, if your data has any of the non-exchangeablity properties that we discuss above, then your evaluation setup should reflect that.