Practical Data Science with R, 2nd Edition — New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! This makes available six chapter in total accessible to MEAP subscribers

Practical Data Science with R, 2nd Edition (MEAP)

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

Announcing Practical Data Science with R, 2nd Edition

I’ve told a few people privately, but now I can announce it publicly: we are working on the second edition of Practical Data Science with R!

Practical Data Science with R, 2nd edition

Manning Publications has just launched the the MEAP for the second edition. The MEAP (Manning Early Access Program) allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, don’t worry. If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

Manning is sharing a 50% off promotion code active until August 23, 2018: mlzumel3.

On Persistence and Sincerity

5245227711 370acc245e z
…propaganda, Boris Artzybasheff. Image: James Vaughan, some rights reserved.

We’re in the middle of marketing efforts here at Win-Vector, and I’ve just spent a few hours going through the Win-Vector blog so I could update our Popular Articles page (I have to do that for Multo, someday, too).

As I went through the blog, I had a number of thoughts:

  • Wow, this is a lot of posts.
  • Wow, we write about a lot of topics.
  • Wow, this is some really great stuff!

I can’t take credit for all that. The Win-Vector blog is John’s baby; he started it way back in July of 2007, and as it’s his only blog, it’s his primary mode of expression (Facebook for cooking, Win-Vector for the techy stuff). He writes more of the posts than I do. But the blog has been good for some of my hobby horses, too.[1]

The excuse for the Win-Vector blog is that it’s “marketing” for the company. And it is; we promote ourselves sometimes: our company, our book, our video courses. But mostly it’s here because we wanted a place to talk about what we care about, and a place to share things we thought would help other people.

Read more of this post

New on Win-Vector: Checking your Data for Signal

NewImage8

I have a new article up on the Win-Vector Blog, on checking your input variables for signal:

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.

In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.

Read the article here.

Another underlying motivation for this article is to encourage giving empirical intuition for common statistical procedures, like testing for significance — in this case testing that your model against the null hypothesis that you are fitting to pure noise. As a data scientist, you may or may not use my suggested heuristic for variable selection, but it’s good to get in the habit of thinking about the things you measure, and not just how to take the measurements, but why.

New on Win-Vector: Variable Selection for Sessionized Data

NewImage

Illustration: Boris Artzybasheff
photo: James Vaughan, some rights reserved

 

I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. As I mentioned in the previous installment, sessionizing log data can potentially lead to very wide data sets, with possibly more variables than there are rows in the training data. In this post, we look at variable selections strategies for this situation — or for any very wide data situation, really.

In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.

Read the rest of the post here.

Enjoy.