A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent:

Wanted: A Perfect Scatterplot (with Marginals)

In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best linear fit to the scatterplot:

NewImage

I also show how to do with it with ggMarginal(), from the ggExtra package.

Working with Sessionized Data 1: Evaluating Hazard Models

This is the start of a mini-series of posts, discussing the analysis of sessionized log data.

NewImage

Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

Click on the links to read.

Enjoy.

Two New Articles

Two new articles, one on the Win-Vector blog, plus a guest post on the Fliptop blog:

Random Test/Train Split is not Always Enough discusses the potential limitations of a randomized test/train split when your training data and future data are not truly exchangeable, due to time dependent effects, serial correlation, concept changes, or data-grouping.

Don’t Use Black-Box Testing to Select a Predictive Lead Scoring Vendor is a commissioned piece for one of our clients, and hosted on their blog. This is related to the first post: if you are running an evaluation of a potential vendor’s decision system, then that test should reflect the environment in which the decision system will be deployed. In particular, if your data has any of the non-exchangeablity properties that we discuss above, then your evaluation setup should reflect that.

Recent post on Win-Vector blog, plus some musings on Audience

 

mds

I put a new post up on Win-Vector a couple of days ago called “The Geometry of Classifiers”, a follow-up post to a recent paper by Fernandez-Delgado, et al. that investigates several classifiers against a body of data sets, mostly from the UCI Machine Learning Repository. Our article follows up the study with seven additional additional classifier implementations from scikit-learn and an interactive Shiny app to explore the results.

As you might guess, we did our little study not only because we were interested in the questions of classifier performance and classifier similarity, but because we wanted an excuse to play with scikit-learn and Shiny. We’re proud of the results (the app is cool!), but we didn’t consider this an especially ground-breaking post. Much to our surprise, this article got over 2000 views the day we posted it (a huge number, for us), up to nearly 3000 as I write this. It’s already our eighth most popular post of this year (an earlier post by John on the Fernandez-Delgado paper, a comment about some of their data treatment is also doing quite well: #2 for the month and #21 for the year).

Read more of this post

New article up on Win-Vector — Vtreat: a package for variable treatment

We are writing an R package to implement some of the data treatment practices that we discuss in Chapters 4 and 6 of Practical Data Science with R. There’s an article describing the package up on the Win-Vector blog:

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

  • Missing values (NA or blanks)
  • Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)
  • Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
  • Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

You can read the article here. Enjoy.

Follow me via RSS!

I went back to using RSS to follow blogs and other websites recently; I don’t know why I ever stopped. My email doesn’t get clogged by notifications anymore, and I don’t lose blog updates in the ever-flowing stream of Twitter or Facebook or the WordPress reader. I can follow any blog on any platform as long as they have an RSS feed, and I don’t need to have accounts on every possible platform, either, just Feedly (and not even that, if I didn’t want to sync between devices).

It also occurred to me that RSS is really the only reliable medium for following an irregular blog like this one. Since I don’t blog on a regular schedule (or all that often), my posts tend to get lost in the WordPress reader, as do tweets and facebook/google+ updates.

So I’ve added a “Follow me on Feedly” button to the side of my blog; if you use another RSS reader, like Bloglovin or NetNewsWire, there is a generic RSS widget, as well. Even if you follow me on WordPress, or follow Win-Vector on Twitter, please do consider also following me (and other bloggers you love) via RSS, so you will be sure to never miss my blog updates. I promise, they will not all be about the book.

Thanks!