WVPlots and Color Controls

I’ve put a new release of the WVPlots package up on CRAN. This release adds consistent palette and/or other color controls to most of the functions in the package.

Unnamed chunk 1 1

WVPlots was originally a convenience package just for us; we put it up on CRAN in the hopes that other people might find our plots to be useful as well. Because it was just for us, we tended to hard-code in our preferred color choices. For example, for plots that color-code by group, I tend to prefer the Brewer Dark2 palette because it is

  • saturated,
  • color-blind friendly for a small number of classes,
  • gray-scale printing friendly (though not photocopy-friendly),
  • reasonably perceptually uniform.

This last property is important when you don’t want the viewer to prefer certain groups over others. Of course, you may have other desiderata for your visualization needs. Sequential or diverging palettes are useful when you do wish to imply an order or ranking among groups; sequential palettes can also be color-blind friendly over a larger number of classes. If perceptual uniformity is important, then the viridis palettes are analytically designed to be perceptually uniform and color-blind friendly (and apparently print-friendly as well). And when you are reporting results and wish to “tell stories” with your data—that is, visually draw your audience to the conclusion you wish them to reach—then hand-tuning your color palette to draw users to pay attention to important groups rather than to less relevant ones can be crucial.

The Brewer family of palettes, developed by cartographer Cynthia Brewer in the early 2000s, includes a variety of qualitative, diverging, and sequential palettes, originally designed for map making. Since the perceptual issues around making legible maps are similar to the issues around making legible data visualizations, I find the Brewer palettes incredibly useful for data science, and WVPlots reflects this preference. If you prefer other palettes, it is also possible to “turn off” the Brewer palettes and use ggplot2‘s default color scheme, to use viridis, or to manually specify the color palette.

You can see some examples of the palettes and color controls in use, in my official announcement of the new version release on the Win-Vector blog.

Here are more interesting references on color, color palettes and their uses:

colorbrewer2.org: A super-useful website for browsing the Brewer palettes. Provides the color designations in Hex, RGB, and CMYK, along with advisory information on whether palettes are color-blind friendly, print friendly, photocopy friendly, and LCD friendly.

Designing Better Maps: A Guide for GIS Users: Professor Brewer’s textbook on map design. Primarily for cartographers, of course, but possibly of interest to data scientists who need to analyze, visualize, and present geographically-based data. Includes a couple of chapters on the use of color, and an appendix about the colorbrewer website.

David Nichols’s Coloring for Colorblindness site includes an online tool to help non-colorblind people visualize what various palettes look like to viewers with various types of colorblindness. It also includes links and suggestions on various colorblind-friendly palettes.

A discussion and video about the development of the viridis palettes by Stéfan van der Walt and Nathaniel Smith, the palette designers. The viridis palettes were originally developed for Python’s matplotlib library.

Storytelling with Data: A Data Visualization Guide for Business Professionals: Cole Nussbaumer Knaflic’s excellent text on effective communication via data visualization. Includes numerous tips on the use of color to guide your narrative.

Happy plotting!

A Couple Recent Win-Vector Posts

I’ve been neglecting to announce my Win-Vector posts here — but I’ve not stopped writing them. Here are the two most recent:

Wanted: A Perfect Scatterplot (with Marginals)

In which I explore how to make what Matlab calls a “scatterhist:” a scatterplot, with marginal distribution plots on the sides. My version optionally adds the best linear fit to the scatterplot:


I also show how to do with it with ggMarginal(), from the ggExtra package.

Working with Sessionized Data 1: Evaluating Hazard Models

This is the start of a mini-series of posts, discussing the analysis of sessionized log data.


Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

Click on the links to read.


Design, Problem Solving, and Good Taste


Image: A Case for Spaceships (Jure Triglav)

I ran across this essay recently on the role of design standards for scientific data visualization. The author, Jure Triglav, draws his inspiration from the creation and continued use of the NYCTA Graphics Standards, which were instituted in the late 1960s to unify the signage for the New York City subway system. As the author puts it, the Graphics Standards Manual is “a timeless example of great design elegantly solving a real problem.” Thanks to the unified iconography, a traveler on the New York subway knows exactly what to look for to navigate the subway system, no matter which station they may be in. And the iconography is beautiful, too.


Unimark, the design company that designed the Graphics Standards.
Aren’t they a hip, mod looking group? And I’m jealous of those lab coats.
Image: A Case for Spaceships (Jure Triglav)

What works to clarify subway travel will work to clarify the morass of graphs and charts that pass for scientific visualization, Triglav argues. And we should start with the work of the Joint Committee on Standards for Graphical Presentation, a group of statisticians, engineers, scientists, and mathematicians who first adopted a set of standards in 1914, revised in 1936, 1938, and 1960.

I agree with him — mostly.

Read more of this post

New Data Visualization post up on Win Vector Blog


I have a new post up on the Win-Vector blog, exploring a couple of data visualization tasks, and touching on the difference between graphing for data exploration and graphing for the communication of results.

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.

One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.

The post concerns itself mostly with the ggplot code to generate the graphs, but there is a bigger-picture point, too. Data visualization is a bit like the drafts of a piece of writing: early graphs are rough, sometimes ugly, and highly detailed. By the time you get to the point of presenting the results — of articulating the “story” that is found in the data — you might want to use graphs that abstract away some of that detail, so that your viewers more clearly see the point you are trying to make, or the key insight that you are trying to convey.

You can read the post here.

Exploring Graphical Perception in ggplot2


A new article that I’ve posted to the Win-Vector blog, on applying principles from William Cleveland’s The Elements of Graphing Data in ggplot2.

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics.

I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.

Details of specific plots aside, the key points of Cleveland’s philosophy are:

  • A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
  • Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.

Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.

Read the rest of the post here.