I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector.
photo: James Vaughan, some rights reserved
As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the training data. In this post, we look at variable selections strategies for this situation -- or for any very wide data situation, really.
In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.
Read the rest of the post here.