WVPlots and Color Controls

I’ve put a new release of the WVPlots package up on CRAN. This release adds consistent palette and/or other color controls to most of the functions in the package.

Unnamed chunk 1 1

WVPlots was originally a convenience package just for us; we put it up on CRAN in the hopes that other people might find our plots to be useful as well. Because it was just for us, we tended to hard-code in our preferred color choices. For example, for plots that color-code by group, I tend to prefer the Brewer Dark2 palette because it is

  • saturated,
  • color-blind friendly for a small number of classes,
  • gray-scale printing friendly (though not photocopy-friendly),
  • reasonably perceptually uniform.

This last property is important when you don’t want the viewer to prefer certain groups over others. Of course, you may have other desiderata for your visualization needs. Sequential or diverging palettes are useful when you do wish to imply an order or ranking among groups; sequential palettes can also be color-blind friendly over a larger number of classes. If perceptual uniformity is important, then the viridis palettes are analytically designed to be perceptually uniform and color-blind friendly (and apparently print-friendly as well). And when you are reporting results and wish to “tell stories” with your data—that is, visually draw your audience to the conclusion you wish them to reach—then hand-tuning your color palette to draw users to pay attention to important groups rather than to less relevant ones can be crucial.

The Brewer family of palettes, developed by cartographer Cynthia Brewer in the early 2000s, includes a variety of qualitative, diverging, and sequential palettes, originally designed for map making. Since the perceptual issues around making legible maps are similar to the issues around making legible data visualizations, I find the Brewer palettes incredibly useful for data science, and WVPlots reflects this preference. If you prefer other palettes, it is also possible to “turn off” the Brewer palettes and use ggplot2‘s default color scheme, to use viridis, or to manually specify the color palette.

You can see some examples of the palettes and color controls in use, in my official announcement of the new version release on the Win-Vector blog.

Here are more interesting references on color, color palettes and their uses:

colorbrewer2.org: A super-useful website for browsing the Brewer palettes. Provides the color designations in Hex, RGB, and CMYK, along with advisory information on whether palettes are color-blind friendly, print friendly, photocopy friendly, and LCD friendly.

Designing Better Maps: A Guide for GIS Users: Professor Brewer’s textbook on map design. Primarily for cartographers, of course, but possibly of interest to data scientists who need to analyze, visualize, and present geographically-based data. Includes a couple of chapters on the use of color, and an appendix about the colorbrewer website.

David Nichols’s Coloring for Colorblindness site includes an online tool to help non-colorblind people visualize what various palettes look like to viewers with various types of colorblindness. It also includes links and suggestions on various colorblind-friendly palettes.

A discussion and video about the development of the viridis palettes by Stéfan van der Walt and Nathaniel Smith, the palette designers. The viridis palettes were originally developed for Python’s matplotlib library.

Storytelling with Data: A Data Visualization Guide for Business Professionals: Cole Nussbaumer Knaflic’s excellent text on effective communication via data visualization. Includes numerous tips on the use of color to guide your narrative.

Happy plotting!

Practical Data Science with R, 2nd Edition — New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! This makes available six chapter in total accessible to MEAP subscribers

Practical Data Science with R, 2nd Edition (MEAP)

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

Announcing Practical Data Science with R, 2nd Edition

I’ve told a few people privately, but now I can announce it publicly: we are working on the second edition of Practical Data Science with R!

Practical Data Science with R, 2nd edition

Manning Publications has just launched the the MEAP for the second edition. The MEAP (Manning Early Access Program) allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, don’t worry. If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

Manning is sharing a 50% off promotion code active until August 23, 2018: mlzumel3.

A Trunkful of Win-Vector R Packages


If you follow the Win-Vector blog, you know that we have developed a number of R packages that encapsulate our data science working process and philosophy. The biggest package, of course, is our data preparation package, vtreat, which implements many of the data treatment principles that I describe in my white-paper, here. Read more of this post

New Win-Vector Package replyr: for easier dplyr

Using dplyr with a specific data frame, where all the columns are known, is an effective and pleasant way to execute declarative (SQL-like) operations on dataframes and dataframe-like objects in R. It also has the advantage of working not only on local data, but also on dplyr-supported remote data stores, like SQL databases or Spark.

However, once we know longer know the column names, the pleasure quickly fades. The currently recommended way to handle dplyr‘s non-standard evaluation is via the lazyeval package. This is not pretty. I never want to write anything like the following, ever again.

# target is a moving target, so to speak
target = "column_I_want"


# return all the rows where target column is NA
dframe %>%
  filter_(interp(~ is.na(col), col=as.name(target)) ) 

This example is fairly simple, but the more complex the dplyr expression, and the more columns involved, the more unwieldy the lazyeval solution becomes.

The difficulty of parameterizing dplyr expressions is part of the motivation for Win-Vector’s new package, replyr. I’ve just posted an article to the Win-Vector blog, on the function replyr::let, which lets us parametrize dplyr expressions without lazyeval.

Read more of this post