New article up on Win-Vector — Vtreat: a package for variable treatment

We are writing an R package to implement some of the data treatment practices that we discuss in Chapters 4 and 6 of Practical Data Science with R. There’s an article describing the package up on the Win-Vector blog:

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

  • Missing values (NA or blanks)
  • Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)
  • Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
  • Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

You can read the article here. Enjoy.

About nzumel
I dance. I'm a data scientist. I'm a dancing data scientist. In my spare time, I like to read folklore (and research about folklore), ghost stories, random cognitive science papers, and to sometimes blog about it all.

Comments are closed.

%d bloggers like this: