I have a new article up on the Win-Vector Blog, on checking your input variables for signal:
An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.
In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.
Another underlying motivation for this article is to encourage giving empirical intuition for common statistical procedures, like testing for significance -- in this case testing that your model against the null hypothesis that you are fitting to pure noise. As a data scientist, you may or may not use my suggested heuristic for variable selection, but it's good to get in the habit of thinking about the things you measure, and not just how to take the measurements, but why.