Announcing Practical Data Science with R, 2nd Edition

I’ve told a few people privately, but now I can announce it publicly: we are working on the second edition of Practical Data Science with R!

Practical Data Science with R, 2nd edition

Manning Publications has just launched the the MEAP for the second edition. The MEAP (Manning Early Access Program) allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, don’t worry. If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

Manning is sharing a 50% off promotion code active until August 23, 2018: mlzumel3.

On Persistence and Sincerity

5245227711 370acc245e z
…propaganda, Boris Artzybasheff. Image: James Vaughan, some rights reserved.

We’re in the middle of marketing efforts here at Win-Vector, and I’ve just spent a few hours going through the Win-Vector blog so I could update our Popular Articles page (I have to do that for Multo, someday, too).

As I went through the blog, I had a number of thoughts:

  • Wow, this is a lot of posts.
  • Wow, we write about a lot of topics.
  • Wow, this is some really great stuff!

I can’t take credit for all that. The Win-Vector blog is John’s baby; he started it way back in July of 2007, and as it’s his only blog, it’s his primary mode of expression (Facebook for cooking, Win-Vector for the techy stuff). He writes more of the posts than I do. But the blog has been good for some of my hobby horses, too.[1]

The excuse for the Win-Vector blog is that it’s “marketing” for the company. And it is; we promote ourselves sometimes: our company, our book, our video courses. But mostly it’s here because we wanted a place to talk about what we care about, and a place to share things we thought would help other people.

Read more of this post

New on Win-Vector: Checking your Data for Signal

NewImage8

I have a new article up on the Win-Vector Blog, on checking your input variables for signal:

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.

Read the article here.

Another underlying motivation for this article is to encourage giving empirical intuition for common statistical procedures, like testing for significance — in this case testing that your model against the null hypothesis that you are fitting to pure noise. As a data scientist, you may or may not use my suggested heuristic for variable selection, but it’s good to get in the habit of thinking about the things you measure, and not just how to take the measurements, but why.

New on Win-Vector: Variable Selection for Sessionized Data

NewImage

Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved

 

I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the training data. In this post, we look at variable selections strategies for this situation — or for any very wide data situation, really.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

Read the rest of the post here.

Enjoy.

A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent:

Wanted: A Perfect Scatterplot (with Marginals)

In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best linear fit to the scatterplot:

NewImage

I also show how to do with it with ggMarginal(), from the ggExtra package.

Working with Sessionized Data 1: Evaluating Hazard Models

This is the start of a mini-series of posts, discussing the analysis of sessionized log data.

NewImage

Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

Click on the links to read.

Enjoy.